2022-2023 fall quarter, CS269 Seminar 5: Reinforcement Learning. Department of Computer Science at University of California, Los Angeles. Course Instructor: Professor Bolei ZHOU. Assignment author: Zhenghao PENG.
| Student Name | Student ID |
|---|---|
| Yingqi Gao | 705435843 |
Welcome to the assignment 2 of our RL course. This assignment consists of these parts:
Section 0 and Section 1 set up the dependencies and prepare some useful functions.
The experiments we'll conduct and their expected goals:
Please install the following dependencies.
MetaDrive is a lightweight driving simulator which we will use for DQN and Policy Gradient methods. It can not be run on M1-chip Mac. We suggest using Colab or Linux for running MetaDrive.
Please ignore this warning from MetaDrive: WARNING:root:BaseEngine is not launched, fail to sync seed to engine!
We have several cells used for installing dependencies for Colab only. Please make sure they are run properly.
You don't need to install python packages again and again after restarting the runtime, since the Colab instance still remembers the python envionment after you installing packages for the first time. But you do need to rerun those packages installation script after you reconnecting to the runtime (which means Google assigns a new machine to you and thus the python environment is new).
!pip install "gym[classic_control,box2d]<0.20.0" seaborn pandas
!pip install torch
Requirement already satisfied: gym[box2d,classic_control]<0.20.0 in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (0.19.0) Requirement already satisfied: seaborn in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (0.12.1) Requirement already satisfied: pandas in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (1.3.5) Requirement already satisfied: numpy>=1.18.0 in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from gym[box2d,classic_control]<0.20.0) (1.21.6) Requirement already satisfied: cloudpickle<1.7.0,>=1.2.0 in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from gym[box2d,classic_control]<0.20.0) (1.6.0) Requirement already satisfied: pyglet>=1.4.0 in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from gym[box2d,classic_control]<0.20.0) (1.5.27) Requirement already satisfied: box2d-py~=2.3.5 in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from gym[box2d,classic_control]<0.20.0) (2.3.8) Requirement already satisfied: typing_extensions in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from seaborn) (4.4.0) Requirement already satisfied: matplotlib!=3.6.1,>=3.1 in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from seaborn) (3.5.3) Requirement already satisfied: pytz>=2017.3 in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from pandas) (2022.6) Requirement already satisfied: python-dateutil>=2.7.3 in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from pandas) (2.8.2) Requirement already satisfied: fonttools>=4.22.0 in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (4.38.0) Requirement already satisfied: cycler>=0.10 in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (0.11.0) Requirement already satisfied: pillow>=6.2.0 in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (9.3.0) Requirement already satisfied: kiwisolver>=1.0.1 in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.4.4) Requirement already satisfied: packaging>=20.0 in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (21.3) Requirement already satisfied: pyparsing>=2.2.1 in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (3.0.9) Requirement already satisfied: six>=1.5 in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from python-dateutil>=2.7.3->pandas) (1.16.0) Requirement already satisfied: torch in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (1.13.0) Requirement already satisfied: typing-extensions in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from torch) (4.4.0)
# Install MetaDrive, a lightweight driving simulator
!pip install git+https://github.com/metadriverse/metadrive
# Test whether MetaDrive is properly installed. No error means the test is passed.
!python -m metadrive.examples.profile_metadrive --num-steps 1000
Collecting git+https://github.com/metadriverse/metadrive Cloning https://github.com/metadriverse/metadrive to /private/var/folders/qn/ktplt3rn673_xx4m99jn41hw0000gn/T/pip-req-build-hxj8ihos Running command git clone --filter=blob:none --quiet https://github.com/metadriverse/metadrive /private/var/folders/qn/ktplt3rn673_xx4m99jn41hw0000gn/T/pip-req-build-hxj8ihos Resolved https://github.com/metadriverse/metadrive to commit 0f8579c305d3d1a27e35fe494f02d42eabec92fc Preparing metadata (setup.py) ... done Requirement already satisfied: gym==0.19.0 in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from metadrive-simulator==0.2.5.2) (0.19.0) Requirement already satisfied: numpy in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from metadrive-simulator==0.2.5.2) (1.21.6) Requirement already satisfied: matplotlib in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from metadrive-simulator==0.2.5.2) (3.5.3) Requirement already satisfied: pandas in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from metadrive-simulator==0.2.5.2) (1.3.5) Requirement already satisfied: pygame in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from metadrive-simulator==0.2.5.2) (2.1.2) Requirement already satisfied: tqdm in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from metadrive-simulator==0.2.5.2) (4.64.1) Requirement already satisfied: yapf in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from metadrive-simulator==0.2.5.2) (0.32.0) Requirement already satisfied: seaborn in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from metadrive-simulator==0.2.5.2) (0.12.1) Requirement already satisfied: panda3d==1.10.8 in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from metadrive-simulator==0.2.5.2) (1.10.8) Requirement already satisfied: panda3d-gltf in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from metadrive-simulator==0.2.5.2) (0.13) Requirement already satisfied: panda3d-simplepbr in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from metadrive-simulator==0.2.5.2) (0.10) Requirement already satisfied: pillow in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from metadrive-simulator==0.2.5.2) (9.3.0) Requirement already satisfied: pytest in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from metadrive-simulator==0.2.5.2) (7.2.0) Requirement already satisfied: opencv-python-headless in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from metadrive-simulator==0.2.5.2) (4.6.0.66) Requirement already satisfied: lxml in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from metadrive-simulator==0.2.5.2) (4.9.1) Requirement already satisfied: scipy in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from metadrive-simulator==0.2.5.2) (1.7.3) Requirement already satisfied: cloudpickle<1.7.0,>=1.2.0 in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from gym==0.19.0->metadrive-simulator==0.2.5.2) (1.6.0) Requirement already satisfied: cycler>=0.10 in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from matplotlib->metadrive-simulator==0.2.5.2) (0.11.0) Requirement already satisfied: kiwisolver>=1.0.1 in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from matplotlib->metadrive-simulator==0.2.5.2) (1.4.4) Requirement already satisfied: pyparsing>=2.2.1 in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from matplotlib->metadrive-simulator==0.2.5.2) (3.0.9) Requirement already satisfied: python-dateutil>=2.7 in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from matplotlib->metadrive-simulator==0.2.5.2) (2.8.2) Requirement already satisfied: fonttools>=4.22.0 in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from matplotlib->metadrive-simulator==0.2.5.2) (4.38.0) Requirement already satisfied: packaging>=20.0 in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from matplotlib->metadrive-simulator==0.2.5.2) (21.3) Requirement already satisfied: pytz>=2017.3 in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from pandas->metadrive-simulator==0.2.5.2) (2022.6) Requirement already satisfied: exceptiongroup>=1.0.0rc8 in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from pytest->metadrive-simulator==0.2.5.2) (1.0.1) Requirement already satisfied: pluggy<2.0,>=0.12 in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from pytest->metadrive-simulator==0.2.5.2) (1.0.0) Requirement already satisfied: iniconfig in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from pytest->metadrive-simulator==0.2.5.2) (1.1.1) Requirement already satisfied: attrs>=19.2.0 in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from pytest->metadrive-simulator==0.2.5.2) (22.1.0) Requirement already satisfied: importlib-metadata>=0.12 in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from pytest->metadrive-simulator==0.2.5.2) (5.0.0) Requirement already satisfied: tomli>=1.0.0 in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from pytest->metadrive-simulator==0.2.5.2) (2.0.1) Requirement already satisfied: typing_extensions in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from seaborn->metadrive-simulator==0.2.5.2) (4.4.0) Requirement already satisfied: zipp>=0.5 in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from importlib-metadata>=0.12->pytest->metadrive-simulator==0.2.5.2) (3.9.0) Requirement already satisfied: six>=1.5 in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (from python-dateutil>=2.7->matplotlib->metadrive-simulator==0.2.5.2) (1.16.0) Successfully registered the following environments: ['MetaDrive-validation-v0', 'MetaDrive-10env-v0', 'MetaDrive-100envs-v0', 'MetaDrive-1000envs-v0', 'SafeMetaDrive-validation-v0', 'SafeMetaDrive-10env-v0', 'SafeMetaDrive-100envs-v0', 'SafeMetaDrive-1000envs-v0', 'MARLTollgate-v0', 'MARLBottleneck-v0', 'MARLRoundabout-v0', 'MARLIntersection-v0', 'MARLParkingLot-v0', 'MARLMetaDrive-v0']. Start to profile the efficiency of MetaDrive with 1000 maps and ~8 vehicles! Finish 100/1000 simulation steps. Time elapse: 0.3066. Average FPS: 326.1803, Average number of vehicles: 5.5000 Finish 200/1000 simulation steps. Time elapse: 0.5918. Average FPS: 337.9601, Average number of vehicles: 6.3333 Finish 300/1000 simulation steps. Time elapse: 0.9368. Average FPS: 320.2486, Average number of vehicles: 6.2500 Finish 400/1000 simulation steps. Time elapse: 1.6150. Average FPS: 247.6849, Average number of vehicles: 8.0000 Finish 500/1000 simulation steps. Time elapse: 1.9934. Average FPS: 250.8231, Average number of vehicles: 7.5714 Finish 600/1000 simulation steps. Time elapse: 2.4431. Average FPS: 245.5924, Average number of vehicles: 8.1250 Finish 700/1000 simulation steps. Time elapse: 2.9372. Average FPS: 238.3212, Average number of vehicles: 8.2222 Finish 800/1000 simulation steps. Time elapse: 3.5457. Average FPS: 225.6227, Average number of vehicles: 8.6000 Finish 900/1000 simulation steps. Time elapse: 4.0943. Average FPS: 219.8190, Average number of vehicles: 8.8182 Finish 1000/1000 simulation steps. Time elapse: 4.7230. Average FPS: 211.7300, Average number of vehicles: 9.1667 Total Time Elapse: 4.723, average FPS: 211.727, average number of vehicles: 9.167.
# If you are using Colab, please run the following script EACH time you disconnect from a Runtime.
!apt-get install -y xvfb python-opengl
!pip install pyvirtualdisplay
zsh:1: command not found: apt-get Requirement already satisfied: pyvirtualdisplay in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (3.0)
# Update(2022-11-03): Fix pyglet compatability issue since it is updated to 2.0.0 recently.
!pip install "pyglet<2.0.0"
Requirement already satisfied: pyglet<2.0.0 in /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages (1.5.27)
# If you are using Colab, please run the following script EACH time you restart the Runtime.
import os
os.environ['SDL_VIDEODRIVER']='dummy'
from pyvirtualdisplay import Display
display = Display(visible=0, size=(400, 300))
display.start()
# Run this cell without modification
# Import some packages that we need to use
import gym
import numpy as np
import pandas as pd
import seaborn as sns
from collections import deque
import copy
from gym.error import Error
from gym import logger, error
import torch
import torch.nn as nn
import time
from IPython.display import clear_output
from gym.envs.registration import register
import copy
import json
import os
import subprocess
import tempfile
import time
import IPython
import PIL
import pygame
def wait(sleep=0.2):
clear_output(wait=True)
time.sleep(sleep)
def merge_config(new_config, old_config):
"""Merge the user-defined config with default config"""
config = copy.deepcopy(old_config)
if new_config is not None:
config.update(new_config)
return config
def test_random_policy(policy, env):
_acts = set()
for i in range(1000):
act = policy(0)
_acts.add(act)
assert env.action_space.contains(act), "Out of the bound!"
if len(_acts) != 1:
print(
"[HINT] Though we call self.policy 'random policy', " \
"we find that generating action randomly at the beginning " \
"and then fixing it during updating values period lead to better " \
"performance. Using purely random policy is not even work! " \
"We encourage you to investigate this issue."
)
# We register a non-slippery version of FrozenLake environment.
try:
register(
id='FrozenLakeNotSlippery-v1',
entry_point='gym.envs.toy_text:FrozenLakeEnv',
kwargs={'map_name' : '4x4', 'is_slippery': False},
max_episode_steps=200,
reward_threshold=0.78, # optimum = .8196
)
except Error:
print("The environment is registered already.")
def _render_helper(env, mode, sleep=0.1):
ret = env.render(mode)
if sleep:
wait(sleep=sleep)
return ret
def animate(img_array):
"""A function that can generate GIF file and show in Notebook."""
path = tempfile.mkstemp(suffix=".gif")[1]
images = [PIL.Image.fromarray(frame) for frame in img_array]
images[0].save(
path,
save_all=True,
append_images=images[1:],
duration=0.05,
loop=0
)
with open(path, "rb") as f:
IPython.display.display(
IPython.display.Image(data=f.read(), format='png'))
def evaluate(policy, num_episodes=1, seed=0, env_name='FrozenLake8x8-v1',
render=None, existing_env=None, max_episode_length=1000,
sleep=0.0, verbose=False):
"""This function evaluate the given policy and return the mean episode
reward.
:param policy: a function whose input is the observation
:param num_episodes: number of episodes you wish to run
:param seed: the random seed
:param env_name: the name of the environment
:param render: a boolean flag indicating whether to render policy
:return: the averaged episode reward of the given policy.
"""
if existing_env is None:
env = gym.make(env_name)
env.seed(seed)
else:
env = existing_env
rewards = []
frames = []
if render: num_episodes = 1
for i in range(num_episodes):
obs = env.reset()
act = policy(obs)
ep_reward = 0
for step_count in range(max_episode_length):
obs, reward, done, info = env.step(act)
act = policy(obs)
ep_reward += reward
if verbose and step_count % 50 == 0:
print("Evaluating {}/{} episodes. We are in {}/{} steps. Current episode reward: {:.3f}".format(
i + 1, num_episodes, step_count + 1, max_episode_length, ep_reward
))
if render:
frames.append(_render_helper(env, render, sleep))
wait(sleep=0.05)
if done:
break
rewards.append(ep_reward)
if render:
env.close()
return np.mean(rewards), {"frames": frames}
The environment is registered already.
# Run this cell without modification
DEFAULT_CONFIG = dict(
seed=0,
max_iteration=20000,
max_episode_length=200,
evaluate_interval=10,
evaluate_num_episodes=10,
learning_rate=0.01,
gamma=0.8,
eps=0.3,
env_name='FrozenLakeNotSlippery-v1'
)
class AbstractTrainer:
"""This is the abstract class for value-based RL trainer. We will inherent
the specify algorithm's trainer from this abstract class, so that we can
reuse the codes.
"""
def __init__(self, config):
self.config = merge_config(config, DEFAULT_CONFIG)
# Create the environment
self.env_name = self.config['env_name']
self.env = gym.make(self.env_name)
# Apply the random seed
self.seed = self.config["seed"]
np.random.seed(self.seed)
self.env.seed(self.seed)
# We set self.obs_dim to the number of possible observation
# if observation space is discrete, otherwise the number
# of observation's dimensions. The same to self.act_dim.
if isinstance(self.env.observation_space, gym.spaces.box.Box):
assert len(self.env.observation_space.shape) == 1
self.obs_dim = self.env.observation_space.shape[0]
self.discrete_obs = False
elif isinstance(self.env.observation_space,
gym.spaces.discrete.Discrete):
self.obs_dim = self.env.observation_space.n
self.discrete_obs = True
else:
raise ValueError("Wrong observation space!")
if isinstance(self.env.action_space, gym.spaces.box.Box):
assert len(self.env.action_space.shape) == 1
self.act_dim = self.env.action_space.shape[0]
elif isinstance(self.env.action_space, gym.spaces.discrete.Discrete):
self.act_dim = self.env.action_space.n
elif isinstance(self.env.action_space, gym.spaces.MultiDiscrete):
MetaDrive-Tut-Easy-v0
else:
raise ValueError("Wrong action space! {}".format(self.env.action_space))
self.eps = self.config['eps']
def process_state(self, state):
"""
Process the raw observation. For example, we can use this function to
convert the input state (integer) to a one-hot vector.
"""
return state
def compute_action(self, processed_state, eps=None):
"""Compute the action given the processed state."""
raise NotImplementedError(
"You need to override the Trainer.compute_action() function.")
def evaluate(self, num_episodes=50, *args, **kwargs):
"""Use the function you write to evaluate current policy.
Return the mean episode reward of 50 episodes."""
if "MetaDrive" in self.env_name:
kwargs["existing_env"] = self.env
result, eval_infos = evaluate(self.policy, num_episodes, seed=self.seed,
env_name=self.env_name, *args, **kwargs)
return result, eval_infos
def policy(self, raw_state, eps=0.0):
"""A wrapper function takes raw_state as input and output action."""
return self.compute_action(self.process_state(raw_state), eps=eps)
def train(self):
"""Conduct one iteration of learning."""
raise NotImplementedError("You need to override the "
"Trainer.train() function.")
# Run this cell without modification
def run(trainer_cls, config=None, reward_threshold=None):
"""Run the trainer and report progress, agnostic to the class of trainer
:param trainer_cls: A trainer class
:param config: A dict
:param reward_threshold: the reward threshold to break the training
:return: The trained trainer and a dataframe containing learning progress
"""
if config is None:
config = {}
trainer = trainer_cls(config)
config = trainer.config
start = now = time.time()
stats = []
total_steps = 0
try:
for i in range(config['max_iteration'] + 1):
stat = trainer.train()
stat = stat or {}
stats.append(stat)
if "episode_len" in stat:
total_steps += stat["episode_len"]
if i % config['evaluate_interval'] == 0 or \
i == config["max_iteration"]:
reward, _ = trainer.evaluate(
config.get("evaluate_num_episodes", 50),
max_episode_length=config.get("max_episode_length", 1000)
)
print("({:.1f}s,+{:.1f}s) Iter {}, {}episodic return"
" is {:.2f}. {}".format(
time.time() - start,
time.time() - now,
i,
"" if total_steps == 0 else "Step {}, ".format(total_steps),
reward,
{k: round(np.mean(v), 4) for k, v in stat.items()
if not np.isnan(v) and k != "frames"
}
if stat else ""
))
now = time.time()
if reward_threshold is not None and reward > reward_threshold:
print("In {} iteration, episodic return {:.3f} is "
"greater than reward threshold {}. Congratulation! Now we "
"exit the training process.".format(
i, reward, reward_threshold))
break
except Exception as e:
print("Error happens during training: ")
raise e
finally:
if hasattr(trainer.env, "close"):
trainer.env.close()
print("Environment is closed.")
return trainer, stats
(20/100 points)
Q-learning is an off-policy algorithm who differs from SARSA in the computing of TD error.
Unlike getting the TD error by running policy to get next_act $a'$ and compute:
$r + \gamma Q(s', a') - Q(s, a)$
as in SARSA, in Q-learning we compute the TD error via:
$r + \gamma \max_{a'} Q(s', a') - Q(s, a)$.
The reason we call it "off-policy" is that the next-Q value is not computed for the "behavior policy", instead, it is a "virtural policy" that always takes the best action given current Q values.
# Solve the TODOs and remove `pass`
# Managing configurations of your experiments is important for your research.
Q_LEARNING_TRAINER_CONFIG = merge_config(dict(
eps=0.3,
), DEFAULT_CONFIG)
class QLearningTrainer(AbstractTrainer):
def __init__(self, config=None):
config = merge_config(config, Q_LEARNING_TRAINER_CONFIG)
super(QLearningTrainer, self).__init__(config=config)
self.gamma = self.config["gamma"]
self.eps = self.config["eps"]
self.max_episode_length = self.config["max_episode_length"]
self.learning_rate = self.config["learning_rate"]
# build the Q table
self.table = np.zeros((self.obs_dim, self.act_dim))
def compute_action(self, obs, eps=None):
"""Implement epsilon-greedy policy
It is a function that take an integer (state / observation)
as input and return an interger (action).
"""
if eps is None:
eps = self.eps
# [TODO] You need to implement the epsilon-greedy policy here.
# with probability 1-epsilon: greedy
if np.random.random() > eps:
action = np.argmax(self.table[obs])
else:
action = self.env.action_space.sample()
return action
def train(self):
"""Conduct one iteration of learning."""
# [TODO] Q table may be need to be reset to zeros.
# if you think it should, than do it. If not, then move on.
obs = self.env.reset()
for t in range(self.max_episode_length):
act = self.compute_action(obs)
next_obs, reward, done, _ = self.env.step(act)
# [TODO] compute the TD error based on the next observation and current reward
td_error = reward + self.gamma * np.max(self.table[next_obs]) - self.table[obs][act]
# [TODO] compute the new Q value
# hint: use TD error, self.learning_rate and current Q value
new_value = self.table[obs][act] + self.learning_rate * td_error
self.table[obs][act] = new_value
obs = next_obs
if done:
break
# Run this cell without modification
q_learning_trainer, _ = run(
trainer_cls=QLearningTrainer,
config=dict(
max_iteration=5000,
evaluate_interval=50,
evaluate_num_episodes=50,
env_name='FrozenLakeNotSlippery-v1'
),
reward_threshold=0.99
)
(0.2s,+0.2s) Iter 0, episodic return is 0.00. (0.3s,+0.2s) Iter 50, episodic return is 0.00. (0.5s,+0.2s) Iter 100, episodic return is 0.00. (0.7s,+0.2s) Iter 150, episodic return is 0.00. (0.8s,+0.2s) Iter 200, episodic return is 0.00. (1.0s,+0.1s) Iter 250, episodic return is 0.00. (1.1s,+0.2s) Iter 300, episodic return is 0.00. (1.3s,+0.2s) Iter 350, episodic return is 0.00. (1.5s,+0.2s) Iter 400, episodic return is 0.00. (1.7s,+0.2s) Iter 450, episodic return is 0.00. (1.9s,+0.2s) Iter 500, episodic return is 0.00. (2.1s,+0.2s) Iter 550, episodic return is 0.00. (2.2s,+0.2s) Iter 600, episodic return is 0.00. (2.4s,+0.1s) Iter 650, episodic return is 0.00. (2.6s,+0.2s) Iter 700, episodic return is 0.00. (2.7s,+0.2s) Iter 750, episodic return is 0.00. (2.9s,+0.2s) Iter 800, episodic return is 0.00. (3.1s,+0.2s) Iter 850, episodic return is 0.00. (3.2s,+0.2s) Iter 900, episodic return is 0.00. (3.4s,+0.2s) Iter 950, episodic return is 0.00. (3.6s,+0.2s) Iter 1000, episodic return is 0.00. (3.8s,+0.2s) Iter 1050, episodic return is 0.00. (4.0s,+0.2s) Iter 1100, episodic return is 0.00. (4.2s,+0.2s) Iter 1150, episodic return is 0.00. (4.3s,+0.2s) Iter 1200, episodic return is 0.00. (4.5s,+0.2s) Iter 1250, episodic return is 0.00. (4.7s,+0.2s) Iter 1300, episodic return is 0.00. (4.9s,+0.2s) Iter 1350, episodic return is 0.00. (5.0s,+0.2s) Iter 1400, episodic return is 0.00. (5.2s,+0.2s) Iter 1450, episodic return is 0.00. (5.4s,+0.2s) Iter 1500, episodic return is 0.00. (5.5s,+0.2s) Iter 1550, episodic return is 0.00. (5.7s,+0.2s) Iter 1600, episodic return is 0.00. (5.9s,+0.2s) Iter 1650, episodic return is 0.00. (6.0s,+0.2s) Iter 1700, episodic return is 0.00. (6.2s,+0.2s) Iter 1750, episodic return is 0.00. (6.4s,+0.2s) Iter 1800, episodic return is 0.00. (6.5s,+0.2s) Iter 1850, episodic return is 0.00. (6.7s,+0.2s) Iter 1900, episodic return is 0.00. (6.9s,+0.2s) Iter 1950, episodic return is 0.00. (7.0s,+0.1s) Iter 2000, episodic return is 0.00. (7.2s,+0.2s) Iter 2050, episodic return is 0.00. (7.4s,+0.2s) Iter 2100, episodic return is 0.00. (7.5s,+0.2s) Iter 2150, episodic return is 0.00. (7.7s,+0.2s) Iter 2200, episodic return is 0.00. (7.9s,+0.2s) Iter 2250, episodic return is 0.00. (8.1s,+0.2s) Iter 2300, episodic return is 0.00. (8.2s,+0.2s) Iter 2350, episodic return is 0.00. (8.4s,+0.2s) Iter 2400, episodic return is 0.00. (8.6s,+0.2s) Iter 2450, episodic return is 0.00. (8.7s,+0.2s) Iter 2500, episodic return is 0.00. (8.9s,+0.2s) Iter 2550, episodic return is 0.00. (9.1s,+0.2s) Iter 2600, episodic return is 0.00. (9.2s,+0.2s) Iter 2650, episodic return is 0.00. (9.4s,+0.2s) Iter 2700, episodic return is 0.00. (9.6s,+0.2s) Iter 2750, episodic return is 0.00. (9.8s,+0.2s) Iter 2800, episodic return is 0.00. (9.9s,+0.2s) Iter 2850, episodic return is 0.00. (10.1s,+0.2s) Iter 2900, episodic return is 0.00. (10.3s,+0.2s) Iter 2950, episodic return is 0.00. (10.4s,+0.2s) Iter 3000, episodic return is 0.00. (10.6s,+0.2s) Iter 3050, episodic return is 0.00. (10.7s,+0.2s) Iter 3100, episodic return is 0.00. (10.9s,+0.2s) Iter 3150, episodic return is 0.00. (11.1s,+0.2s) Iter 3200, episodic return is 0.00. (11.3s,+0.2s) Iter 3250, episodic return is 0.00. (11.4s,+0.2s) Iter 3300, episodic return is 0.00. (11.6s,+0.2s) Iter 3350, episodic return is 0.00. (11.8s,+0.2s) Iter 3400, episodic return is 0.00. (11.9s,+0.2s) Iter 3450, episodic return is 0.00. (12.1s,+0.2s) Iter 3500, episodic return is 0.00. (12.3s,+0.2s) Iter 3550, episodic return is 0.00. (12.5s,+0.2s) Iter 3600, episodic return is 0.00. (12.6s,+0.2s) Iter 3650, episodic return is 0.00. (12.7s,+0.0s) Iter 3700, episodic return is 1.00. In 3700 iteration, episodic return 1.000 is greater than reward threshold 0.99. Congratulation! Now we exit the training process. Environment is closed.
# Run this cell without modification
# Render the learned behavior
_ = evaluate(
policy=q_learning_trainer.policy,
num_episodes=1,
env_name=q_learning_trainer.env_name,
render="human", # Visualize the behavior here in the cell
sleep=0.5 # The time interval between two rendering frames
)
(Right)
SFFF
FHFH
FFFH
HFFG
(30 / 100 points)
In this section, we will implement a basic neural network and Deep Q Learning with Pytorch, a powerful deep learning framework.
Before start, you need to make sure using pip install torch to install it (see Section 0).
If you are not familiar with Pytorch, we suggest you to go through pytorch official quickstart tutorials:
Different from the Q learning in Section 2, we will implement Deep Q Network (DQN) in this section. The main differences are summarized as follows:
DQN requires an experience replay memory to store the transitions. A replay memory is implemented in the following ExperienceReplayMemory class. It contains a certain amount of transitions: (s_t, a_t, r_t, s_t+1, done_t). When the memory is full, the earliest transition is discarded to store the latest one.
The introduction of replay memory increases the sample efficiency (since each transition might be used multiple times) when solving complex task. However, you may find it learn slowly in this assignment since the CartPole-v0 is a relatively easy environment.
DQN has a delayed-updating target network. DQN maintains another neural network called the target network that has identical structure of the Q network. After a certain amount of steps has been taken, the target network copies the parameters of the Q network to itself. Normally, the update of target network is much less frequent than the update of the Q network, since the Q network is updated in each step.
The reason to leverage the target network is to stabilize the estimation of the TD error. In DQN, the TD error is evaluated as:
$$(r_t + \gamma \max_{a_{t+1}} Q^{target}(s_{t+1}, a_{t+1}) - Q(s_t, a_t))$$The Q value of the next state is estimated by the target network, not the Q network that is being updated. This mechanism can reduce the variance of gradient because the next Q values is not influenced by the update of current Q network.
# Solve the TODOs and remove `pass`
from collections import deque
import random
class ExperienceReplayMemory:
"""Store and sample the transitions"""
def __init__(self, capacity):
# deque is a useful class which acts like a list but only contain
# finite elements. When adding new element into the deque will make deque full with
# `maxlen` elements, the oldest element (the index 0 element) will be removed.
# [TODO] uncomment next line.
self.memory = deque(maxlen=capacity)
def push(self, transition):
self.memory.append(transition)
def sample(self, batch_size):
return random.sample(self.memory, batch_size)
def __len__(self):
return len(self.memory)
# Solve the TODOs and remove `pass`
class PytorchModel(nn.Module):
def __init__(self, num_inputs, num_actions, hidden_units=100):
super(PytorchModel, self).__init__()
print("Num inputs: {}, Num actions: {}".format(num_inputs, num_actions))
# [TODO] Build a nn.Sequential object as the neural network with two layers.
# The first hidden layer has `hidden_units` hidden units, followed by
# a ReLU activation function.
# The second hidden layer takes `hidden_units`-dimensional vector as input
# and output another `hidden_units`-dimensional vector, followed by ReLU activation.
# The third layer take the activation vector from the second hidden layer, who has
# `hidden_units` elements, as input and return `num_actions` values.
self.action_value = nn.Sequential(
nn.Linear(num_inputs, hidden_units),
nn.ReLU(),
nn.Linear(hidden_units, hidden_units),
nn.ReLU(),
nn.Linear(hidden_units, num_actions)
)
def forward(self, obs):
return self.action_value(obs)
# Test
test_pytorch_model = PytorchModel(num_inputs=3, num_actions=7, hidden_units=123)
assert isinstance(test_pytorch_model.action_value, nn.Module)
assert len(test_pytorch_model.state_dict()) == 6
assert test_pytorch_model.state_dict()["action_value.0.weight"].shape == (123, 3)
print("Name of each parameter vectors: ", test_pytorch_model.state_dict().keys())
print("Test passed!")
Num inputs: 3, Num actions: 7 Name of each parameter vectors: odict_keys(['action_value.0.weight', 'action_value.0.bias', 'action_value.2.weight', 'action_value.2.bias', 'action_value.4.weight', 'action_value.4.bias']) Test passed!
# Solve the TODOs and remove `pass`
DQN_CONFIG = merge_config(dict(
parameter_std=0.01,
learning_rate=0.01,
hidden_dim=100,
clip_norm=1.0,
clip_gradient=True,
max_iteration=1000,
max_episode_length=1000,
evaluate_interval=100,
gamma=0.99,
eps=0.3,
memory_size=50000,
learn_start=5000,
batch_size=32,
target_update_freq=500, # in steps
learn_freq=1, # in steps
n=1,
env_name="CartPole-v0",
), Q_LEARNING_TRAINER_CONFIG)
def to_tensor(x):
"""A helper function to transform a numpy array to a Pytorch Tensor"""
if isinstance(x, np.ndarray):
x = torch.from_numpy(x).type(torch.float32)
assert isinstance(x, torch.Tensor)
if x.dim() == 3 or x.dim() == 1:
x = x.unsqueeze(0)
assert x.dim() == 2 or x.dim() == 4, x.shape
return x
class DQNTrainer(AbstractTrainer):
def __init__(self, config):
config = merge_config(config, DQN_CONFIG)
self.learning_rate = config["learning_rate"]
super().__init__(config)
self.memory = ExperienceReplayMemory(config["memory_size"])
self.learn_start = config["learn_start"]
self.batch_size = config["batch_size"]
self.target_update_freq = config["target_update_freq"]
self.clip_norm = config["clip_norm"]
self.hidden_dim = config["hidden_dim"]
self.max_episode_length = self.config["max_episode_length"]
self.learning_rate = self.config["learning_rate"]
self.gamma = self.config["gamma"]
self.n = self.config["n"]
self.step_since_update = 0
self.total_step = 0
# You need to setup the parameter for your function approximator.
self.initialize_parameters()
def initialize_parameters(self):
self.network = None
print("Setting up self.network with obs dim: {} and action dim: {}".format(self.obs_dim, self.act_dim))
self.network = PytorchModel(self.obs_dim, self.act_dim)
self.network.eval()
self.network.share_memory()
# [TODO] Uncomment next few lines
# Initialize target network, which is identical to self.network,
# and should have the same weights with self.network. So you should
# put the weights of self.network into self.target_network.
self.target_network = PytorchModel(self.obs_dim, self.act_dim)
self.target_network.load_state_dict(self.network.state_dict())
self.target_network.eval()
# Build Adam optimizer and MSE Loss.
# [TODO] Uncomment next few lines
self.optimizer = torch.optim.Adam(
self.network.parameters(), lr=self.learning_rate
)
self.loss = nn.MSELoss()
def process_state(self, state):
"""Preprocess the state (observation).
If the environment provides discrete observation (state), transform
it to one-hot form. For example, the environment FrozenLake-v0
provides an integer in [0, ..., 15] denotes the 16 possible states.
We transform it to one-hot style:
original state 0 -> one-hot vector [1, 0, 0, 0, 0, 0, 0, 0, ...]
original state 1 -> one-hot vector [0, 1, 0, 0, 0, 0, 0, 0, ...]
original state 15 -> one-hot vector [0, ..., 0, 0, 0, 0, 0, 1]
If the observation space is continuous, then you should do nothing.
"""
if not self.discrete_obs:
return state
else:
new_state = np.zeros((self.obs_dim,))
new_state[state] = 1
return new_state
def compute_values(self, processed_state):
"""Compute the value for each potential action. Note that you
should NOT preprocess the state here."""
values = self.network(processed_state).detach().numpy()
return values
def compute_action(self, processed_state, eps=None):
"""Compute the action given the state. Note that the input
is the processed state."""
values = self.compute_values(processed_state)
assert values.ndim == 1, values.shape
if eps is None:
eps = self.eps
if np.random.uniform(0, 1) < eps:
action = self.env.action_space.sample()
else:
action = np.argmax(values)
return action
def train(self):
s = self.env.reset()
processed_s = self.process_state(s)
act = self.compute_action(processed_s)
stat = {"loss": [], "success_rate": np.nan}
for t in range(self.max_episode_length):
next_state, reward, done, info = self.env.step(act)
next_processed_s = self.process_state(next_state)
# Push the transition into memory.
self.memory.push(
(processed_s, act, reward, next_processed_s, done)
)
processed_s = next_processed_s
act = self.compute_action(next_processed_s)
self.step_since_update += 1
self.total_step += 1
if done:
# print("INFO: ", info)
if "arrive_dest" in info:
stat["success_rate"] = info["arrive_dest"]
break
if t % self.config["learn_freq"] != 0:
# It's not necessary to update in each step.
continue
if len(self.memory) < self.learn_start:
continue
elif len(self.memory) == self.learn_start:
print("Current memory contains {} transitions, "
"start learning!".format(self.learn_start))
batch = self.memory.sample(self.batch_size)
# Transform a batch of state / action / .. into a tensor.
state_batch = to_tensor(
np.stack([transition[0] for transition in batch])
)
action_batch = to_tensor(
np.stack([transition[1] for transition in batch])
)
reward_batch = to_tensor(
np.stack([transition[2] for transition in batch])
)
next_state_batch = torch.stack(
[transition[3] for transition in batch]
)
done_batch = to_tensor(
np.stack([transition[4] for transition in batch])
)
with torch.no_grad():
# [TODO] Compute the Q values of next states
Q_t_plus_one = (1-done_batch[0]) * self.target_network(next_state_batch).max(dim=1)[0]
assert isinstance(Q_t_plus_one, torch.Tensor)
assert Q_t_plus_one.dim() == 1
# [TODO] Compute the target value of Q
Q_target = (reward_batch[0] + self.gamma * Q_t_plus_one).float()
assert Q_target.shape == (self.batch_size,)
# Collect the Q values in batch.
self.network.train()
q_out = self.network(state_batch)
assert q_out.dim() == 2
Q_t = q_out.gather(1, action_batch.long().view(-1, 1)).squeeze(-1)
assert Q_t.shape == Q_target.shape
# Update the network
self.optimizer.zero_grad()
loss = self.loss(input=Q_t, target=Q_target)
loss_value = loss.item()
stat['loss'].append(loss_value)
loss.backward()
# [TODO] Gradient clipping. Uncomment next line
nn.utils.clip_grad_norm_(self.network.parameters(), self.clip_norm)
self.optimizer.step()
self.network.eval()
if len(self.memory) >= self.learn_start and \
self.step_since_update > self.target_update_freq:
print("{} steps has passed since last update. Now update the"
" parameter of the behavior policy. Current step: {}".format(
self.step_since_update, self.total_step
))
self.step_since_update = 0
# [TODO] Copy the weights of self.network to self.target_network.
self.target_network.load_state_dict(self.network.state_dict())
self.target_network.eval()
ret = {"loss": np.mean(stat["loss"]), "episode_len": t}
if "success_rate" in stat:
ret["success_rate"] = stat["success_rate"]
return ret
def process_state(self, state):
return torch.from_numpy(state).type(torch.float32)
def save(self, loc="model.pt"):
torch.save(self.network.state_dict(), loc)
def load(self, loc="model.pt"):
self.network.load_state_dict(torch.load(loc))
# Run this cell without modification
# Build the test trainer.
test_trainer = DQNTrainer({})
# Test compute_values
fake_state = test_trainer.env.observation_space.sample()
processed_state = test_trainer.process_state(fake_state)
assert processed_state.shape == (test_trainer.obs_dim, ), processed_state.shape
values = test_trainer.compute_values(processed_state)
assert values.shape == (test_trainer.act_dim, ), values.shape
test_trainer.train()
print("Now your codes should be bug-free.")
_ = run(DQNTrainer, dict(
max_iteration=20,
evaluate_interval=10,
learn_start=100,
env_name="CartPole-v0",
))
test_trainer.save("test_trainer.pt")
test_trainer.load("test_trainer.pt")
print("Test passed!")
Setting up self.network with obs dim: 4 and action dim: 2
Num inputs: 4, Num actions: 2
Num inputs: 4, Num actions: 2
Now your codes should be bug-free.
Setting up self.network with obs dim: 4 and action dim: 2
Num inputs: 4, Num actions: 2
Num inputs: 4, Num actions: 2
(0.0s,+0.0s) Iter 0, Step 9, episodic return is 9.20. {'episode_len': 9.0}
Current memory contains 100 transitions, start learning!
(0.1s,+0.1s) Iter 10, Step 118, episodic return is 9.20. {'loss': 0.0139, 'episode_len': 11.0}
/Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages/numpy/core/fromnumeric.py:3441: RuntimeWarning: Mean of empty slice. out=out, **kwargs) /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount)
(0.5s,+0.3s) Iter 20, Step 262, episodic return is 9.20. {'loss': 0.0017, 'episode_len': 8.0}
Environment is closed.
Test passed!
# Run this cell without modification
pytorch_trainer, pytorch_stat = run(DQNTrainer, dict(
max_iteration=2000,
evaluate_interval=50,
learning_rate=0.01,
clip_norm=10.0,
memory_size=50000,
learn_start=1000,
eps=0.1,
target_update_freq=1000,
batch_size=32,
env_name="CartPole-v0",
), reward_threshold=195.0)
reward, _ = pytorch_trainer.evaluate()
assert reward > 195.0, "Check your codes. " \
"Your agent should achieve {} reward in 1000 iterations." \
"But it achieve {} reward in evaluation.".format(195.0, reward)
pytorch_trainer.save("dqn_trainer_cartpole.pt")
# Should solve the task in 10 minutes
Setting up self.network with obs dim: 4 and action dim: 2
Num inputs: 4, Num actions: 2
Num inputs: 4, Num actions: 2
(0.0s,+0.0s) Iter 0, Step 9, episodic return is 9.20. {'episode_len': 9.0}
(0.1s,+0.1s) Iter 50, Step 437, episodic return is 9.20. {'episode_len': 9.0}
(0.2s,+0.1s) Iter 100, Step 876, episodic return is 9.20. {'episode_len': 9.0}
Current memory contains 1000 transitions, start learning!
1006 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 1006
(1.2s,+1.0s) Iter 150, Step 1345, episodic return is 9.20. {'loss': 0.0836, 'episode_len': 9.0}
(2.3s,+1.1s) Iter 200, Step 1792, episodic return is 9.40. {'loss': 0.0959, 'episode_len': 11.0}
1005 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 2011
(3.4s,+1.1s) Iter 250, Step 2248, episodic return is 9.90. {'loss': 0.0351, 'episode_len': 7.0}
(4.5s,+1.0s) Iter 300, Step 2707, episodic return is 9.80. {'loss': 0.0652, 'episode_len': 9.0}
1011 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 3022
(5.6s,+1.2s) Iter 350, Step 3207, episodic return is 10.90. {'loss': 0.0203, 'episode_len': 9.0}
1011 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 4033
(7.2s,+1.6s) Iter 400, Step 3762, episodic return is 26.50. {'loss': 0.033, 'episode_len': 13.0}
1001 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 5034
(10.7s,+3.4s) Iter 450, Step 5057, episodic return is 57.00. {'loss': 0.0518, 'episode_len': 29.0}
1032 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 6066
1083 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 7149
1127 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 8276
1039 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 9315
(20.6s,+9.9s) Iter 500, Step 9452, episodic return is 199.10. {'loss': 0.1649, 'episode_len': 199.0}
In 500 iteration, episodic return 199.100 is greater than reward threshold 195.0. Congratulation! Now we exit the training process.
Environment is closed.
# Run this cell without modification
import matplotlib.pyplot as plt
%matplotlib inline
# Render the learned behavior
eval_reward, eval_info = evaluate(
policy=pytorch_trainer.policy,
num_episodes=1,
env_name=pytorch_trainer.env_name,
render="rgb_array", # Visualize the behavior here in the cell
)
animate(eval_info["frames"])
print("DQN agent achieves {} return.".format(eval_reward))
DQN agent achieves 200.0 return.
# Run this cell without modification
def register_metadrive():
from gym.envs.registration import register
from gym import Wrapper
try:
from metadrive.envs import MetaDriveEnv
from metadrive.utils.config import merge_config_with_unknown_keys
except ImportError as e:
print("Please install MetaDrive through: pip install git+https://github.com/decisionforce/metadrive")
raise e
env_names = []
try:
class MetaDriveEnvD(Wrapper):
def __init__(self, config, *args, **kwargs):
super().__init__(MetaDriveEnv(config))
self.action_space = gym.spaces.Discrete(int(np.prod(self.env.action_space.nvec)))
_make_env = lambda config=None: MetaDriveEnvD(config)
env_name = "MetaDrive-Tut-Easy-v0"
register(id=env_name, entry_point=_make_env, kwargs={"config": dict(
map="S",
start_seed=0,
environment_num=1,
horizon=200,
discrete_action=True,
discrete_steering_dim=3,
discrete_throttle_dim=3
)})
env_names.append(env_name)
env_name = "MetaDrive-Tut-Hard-v0"
register(id=env_name, entry_point=_make_env, kwargs={"config": dict(
map="CCC",
start_seed=0,
environment_num=10,
discrete_action=True,
discrete_steering_dim=5,
discrete_throttle_dim=5
)})
env_names.append(env_name)
except gym.error.Error as e:
print("Information when registering MetaDrive: ", e)
else:
print("Successfully registered MetaDrive environments: ", env_names)
# Run this cell without modification
register_metadrive()
Successfully registered the following environments: ['MetaDrive-validation-v0', 'MetaDrive-10env-v0', 'MetaDrive-100envs-v0', 'MetaDrive-1000envs-v0', 'SafeMetaDrive-validation-v0', 'SafeMetaDrive-10env-v0', 'SafeMetaDrive-100envs-v0', 'SafeMetaDrive-1000envs-v0', 'MARLTollgate-v0', 'MARLBottleneck-v0', 'MARLRoundabout-v0', 'MARLIntersection-v0', 'MARLParkingLot-v0', 'MARLMetaDrive-v0']. Successfully registered MetaDrive environments: ['MetaDrive-Tut-Easy-v0', 'MetaDrive-Tut-Hard-v0']
# Run this cell without modification
# Build the test trainer.
test_trainer = DQNTrainer(dict(env_name="MetaDrive-Tut-Easy-v0"))
# Test compute_values
for _ in range(10):
fake_state = test_trainer.env.observation_space.sample()
processed_state = test_trainer.process_state(fake_state)
assert processed_state.shape == (test_trainer.obs_dim, ), processed_state.shape
values = test_trainer.compute_values(processed_state)
assert values.shape == (test_trainer.act_dim, ), values.shape
test_trainer.train()
print("Now your codes should be bug-free.")
test_trainer.env.close()
del test_trainer
WARNING:root:BaseEngine is not launched, fail to sync seed to engine!
Setting up self.network with obs dim: 259 and action dim: 9 Num inputs: 259, Num actions: 9 Num inputs: 259, Num actions: 9 Now your codes should be bug-free.
# Run this cell without modification
env_name = "MetaDrive-Tut-Easy-v0"
pytorch_trainer2, _ = run(DQNTrainer, dict(
max_episode_length=200,
max_iteration=5000,
evaluate_interval=10,
evaluate_num_episodes=10,
learning_rate=0.0001,
clip_norm=10.0,
memory_size=1000000,
learn_start=2000,
eps=0.1,
target_update_freq=5000,
learn_freq=16,
batch_size=256,
env_name=env_name
), reward_threshold=120)
pytorch_trainer2.save("dqn_trainer_metadrive_easy.pt")
WARNING:root:BaseEngine is not launched, fail to sync seed to engine! :task(warning): Creating implicit AsyncTaskChain default for AsyncTaskManager TaskManager
Setting up self.network with obs dim: 259 and action dim: 9
Num inputs: 259, Num actions: 9
Num inputs: 259, Num actions: 9
(3.0s,+3.0s) Iter 0, Step 199, episodic return is -0.57. {'episode_len': 199.0}
(8.1s,+5.0s) Iter 10, Step 2189, episodic return is -0.57. {'loss': 0.0072, 'episode_len': 199.0}
(9.9s,+1.8s) Iter 20, Step 3056, episodic return is -4.41. {'loss': 0.0545, 'episode_len': 20.0, 'success_rate': 0.0}
(14.3s,+4.4s) Iter 30, Step 4498, episodic return is 125.58. {'loss': 0.1553, 'episode_len': 48.0, 'success_rate': 0.0}
In 30 iteration, episodic return 125.581 is greater than reward threshold 120. Congratulation! Now we exit the training process.
Environment is closed.
# Run this cell without modification
# Render the learned behavior
# NOTE: The learned agent is marked by green color.
eval_reward, eval_info = evaluate(
policy=pytorch_trainer2.policy,
num_episodes=1,
env_name=pytorch_trainer2.env_name,
render="topdown", # Visualize the behaviors in top-down view
verbose=True
)
frames = [pygame.surfarray.array3d(f).swapaxes(0, 1) for f in eval_info["frames"]]
animate(frames)
print("DQN agent achieves {} return in MetaDrive easy environment.".format(eval_reward))
DQN agent achieves 125.58145966674864 return in MetaDrive easy environment.
We will train agent to solve a hard MetaDrive environment with multiple curved road segments. We will visualize the behavior of agent later.
The training log of my experiment is left below for your information. As you can see the performance is not good in terms of the zero success rate.
GOAL: achieve episodic return > 50.
BONUS!! can be earned if you can improve the training performance by adjusting hyper-parameters and optimizing code. Improvement means achieving > 0.0 success rate. However, I can't promise that it is feasible to use DQN algorithm to solve this task. Please creates a independent markdown cell to highlight your improvement.
# Run this cell without modification
# (of course you can adjust hyper-parameters if you like)
# We might want to stop the training and restore later.
# Therefore, we don't use the `run` function but instead
# explicitly expose the trainer here.
# This can avoid the loss of trained agent if any unexpected error
# happens during training and thus you can stop at any time and then
# run next cell to see the visualization.
# This also allow us to save and restore the intermiedate agents if want.
metadrive_config = dict(
max_episode_length=1000,
max_iteration=5000,
evaluate_interval=50,
evaluate_num_episodes=5,
learning_rate=0.0001,
clip_norm=10.0,
memory_size=1000000,
learn_start=5000,
eps=0.2,
target_update_freq=5000,
learn_freq=16,
batch_size=256,
env_name="MetaDrive-Tut-Hard-v0"
)
metadrive_reward_threshold = 1000
metadrive_trainer = DQNTrainer(metadrive_config)
# We might want to load trained trainer to pick up training:
if os.path.isfile("dqn_trainer_metadrive_hard.pt"):
metadrive_trainer.load("dqn_trainer_metadrive_hard.pt")
metadrive_config = metadrive_trainer.config
start = now = time.time()
stats = []
total_steps = 0
try:
for i in range(metadrive_config['max_iteration'] + 1):
stat = metadrive_trainer.train()
stat = stat or {}
stats.append(stat)
metadrive_trainer.save("dqn_trainer_metadrive_hard.pt")
if "episode_len" in stat:
total_steps += stat["episode_len"]
if i % metadrive_config['evaluate_interval'] == 0 or \
i == metadrive_config["max_iteration"]:
reward, _ = metadrive_trainer.evaluate(
metadrive_config.get("evaluate_num_episodes", 50),
max_episode_length=metadrive_config.get("max_episode_length", 1000)
)
print("({:.1f}s,+{:.1f}s) Iter {}, {}episodic return"
" is {:.2f}. {}".format(
time.time() - start,
time.time() - now,
i,
"" if total_steps == 0 else "Step {}, ".format(total_steps),
reward,
{k: round(np.mean(v), 4) for k, v in stat.items()
if not np.isnan(v) and k != "frames"
}
if stat else ""
))
now = time.time()
if metadrive_reward_threshold is not None and reward > metadrive_reward_threshold:
print("In {} iteration, episodic return {:.3f} is "
"greater than reward threshold {}. Congratulation! Now we "
"exit the training process.".format(
i, reward, metadrive_reward_threshold))
break
except Exception as e:
print("Error happens during training: ")
raise e
finally:
if hasattr(metadrive_trainer.env, "close"):
metadrive_trainer.env.close()
print("Environment is closed.")
WARNING:root:BaseEngine is not launched, fail to sync seed to engine! :task(warning): Creating implicit AsyncTaskChain default for AsyncTaskManager TaskManager
Setting up self.network with obs dim: 259 and action dim: 25
Num inputs: 259, Num actions: 25
Num inputs: 259, Num actions: 25
(17.5s,+17.5s) Iter 0, Step 149, episodic return is 282.84. {'episode_len': 149.0, 'success_rate': 0.0}
5685 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 5685
6000 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 11685
5271 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 16956
6000 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 22956
5336 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 28292
5060 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 33352
5912 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 39264
(200.7s,+183.2s) Iter 50, Step 42328, episodic return is 303.20. {'loss': 0.2947, 'episode_len': 590.0, 'success_rate': 0.0}
5944 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 45208
5636 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 50844
5975 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 56819
5067 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 61886
5509 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 67395
5212 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 72607
(358.0s,+157.3s) Iter 100, Step 75461, episodic return is 334.06. {'loss': 0.5661, 'episode_len': 999.0}
5777 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 78384
5308 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 83692
5168 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 88860
5309 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 94169
5481 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 99650
(478.6s,+120.7s) Iter 150, Step 99499, episodic return is 181.72. {'loss': 0.9286, 'episode_len': 618.0, 'success_rate': 1.0}
5131 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 104781
5129 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 109910
5413 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 115323
(573.9s,+95.3s) Iter 200, Step 117285, episodic return is 152.99. {'loss': 0.7506, 'episode_len': 276.0, 'success_rate': 0.0}
5326 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 120649
5158 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 125807
5182 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 130989
(649.6s,+75.7s) Iter 250, Step 132059, episodic return is 98.61. {'loss': 1.2871, 'episode_len': 257.0, 'success_rate': 0.0}
5133 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 136122
5268 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 141390
5082 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 146472
(724.3s,+74.6s) Iter 300, Step 146325, episodic return is 98.24. {'loss': 1.17, 'episode_len': 100.0, 'success_rate': 0.0}
5052 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 151524
5063 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 156587
(803.6s,+79.3s) Iter 350, Step 160872, episodic return is 171.56. {'loss': 1.7305, 'episode_len': 357.0, 'success_rate': 0.0}
5179 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 161766
5282 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 167048
5003 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 172051
(875.4s,+71.8s) Iter 400, Step 174156, episodic return is 125.61. {'loss': 2.5448, 'episode_len': 198.0, 'success_rate': 0.0}
5043 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 177094
5344 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 182438
(938.7s,+63.4s) Iter 450, Step 185705, episodic return is 149.42. {'loss': 2.5494, 'episode_len': 490.0, 'success_rate': 0.0}
5277 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 187715
5092 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 192807
(996.9s,+58.2s) Iter 500, Step 196743, episodic return is 169.28. {'loss': 3.6407, 'episode_len': 82.0, 'success_rate': 0.0}
5132 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 197939
5002 steps has passed since last update. Now update the parameter of the behavior policy. Current step: 202941
Environment is closed.
--------------------------------------------------------------------------- KeyboardInterrupt Traceback (most recent call last) /var/folders/qn/ktplt3rn673_xx4m99jn41hw0000gn/T/ipykernel_8227/3294512945.py in <module> 42 try: 43 for i in range(metadrive_config['max_iteration'] + 1): ---> 44 stat = metadrive_trainer.train() 45 stat = stat or {} 46 stats.append(stat) /var/folders/qn/ktplt3rn673_xx4m99jn41hw0000gn/T/ipykernel_8227/2480214744.py in train(self) 133 134 for t in range(self.max_episode_length): --> 135 next_state, reward, done, info = self.env.step(act) 136 next_processed_s = self.process_state(next_state) 137 ~/opt/anaconda3/envs/cs269/lib/python3.7/site-packages/gym/core.py in step(self, action) 246 247 def step(self, action): --> 248 return self.env.step(action) 249 250 def reset(self, **kwargs): ~/opt/anaconda3/envs/cs269/lib/python3.7/site-packages/metadrive/envs/base_env.py in step(self, actions) 244 self.episode_steps += 1 245 actions = self._preprocess_actions(actions) --> 246 engine_info = self._step_simulator(actions) 247 o, r, d, i = self._get_step_return(actions, engine_info=engine_info) 248 return o, r, d, i ~/opt/anaconda3/envs/cs269/lib/python3.7/site-packages/metadrive/envs/base_env.py in _step_simulator(self, actions) 274 self.engine.step(self.config["decision_repeat"]) 275 # update states, if restore from episode data, position and heading will be force set in update_state() function --> 276 scene_manager_after_step_infos = self.engine.after_step() 277 return merge_dicts( 278 scene_manager_after_step_infos, scene_manager_before_step_infos, allow_new_keys=True, without_copy=True ~/opt/anaconda3/envs/cs269/lib/python3.7/site-packages/metadrive/engine/base_engine.py in after_step(self) 248 step_infos = {} 249 for manager in self.managers.values(): --> 250 new_step_info = manager.after_step() 251 step_infos = concat_step_infos([step_infos, new_step_info]) 252 self.interface.after_step() ~/opt/anaconda3/envs/cs269/lib/python3.7/site-packages/metadrive/manager/traffic_manager.py in after_step(self) 94 v_to_remove = [] 95 for v in self._traffic_vehicles: ---> 96 v.after_step() 97 if not v.on_lane: 98 v_to_remove.append(v) ~/opt/anaconda3/envs/cs269/lib/python3.7/site-packages/metadrive/component/vehicle/base_vehicle.py in after_step(self) 252 def after_step(self): 253 if self.navigation is not None: --> 254 self.navigation.update_localization(self) 255 self._state_check() 256 self.update_dist_to_left_right() ~/opt/anaconda3/envs/cs269/lib/python3.7/site-packages/metadrive/component/vehicle_navigation_module/node_network_navigation.py in update_localization(self, ego_vehicle) 99 def update_localization(self, ego_vehicle): 100 position = ego_vehicle.position --> 101 lane, lane_index = self._update_current_lane(ego_vehicle) 102 long, _ = lane.local_coordinates(position) 103 need_update = self._update_target_checkpoints(lane_index, long) ~/opt/anaconda3/envs/cs269/lib/python3.7/site-packages/metadrive/component/vehicle_navigation_module/node_network_navigation.py in _update_current_lane(self, ego_vehicle) 261 262 def _update_current_lane(self, ego_vehicle): --> 263 lane, lane_index, on_lane = self._get_current_lane(ego_vehicle) 264 ego_vehicle.on_lane = on_lane 265 if lane is None: ~/opt/anaconda3/envs/cs269/lib/python3.7/site-packages/metadrive/component/vehicle_navigation_module/node_network_navigation.py in _get_current_lane(self, ego_vehicle) 189 """ 190 possible_lanes, on_lane = ray_localization( --> 191 ego_vehicle.heading, ego_vehicle.position, ego_vehicle.engine, return_all_result=True, return_on_lane=True 192 ) 193 for lane, index, l_1_dist in possible_lanes: ~/opt/anaconda3/envs/cs269/lib/python3.7/site-packages/metadrive/utils/scene_utils.py in ray_localization(heading, position, engine, return_all_result, use_heading_filter, return_on_lane) 181 # dot_result = dir.dot(heading) 182 --> 183 dot_result = math.cos(lane_heading) * heading[0] + math.sin(lane_heading) * heading[1] 184 cosangle = dot_result / ( 185 norm(math.cos(lane_heading), math.sin(lane_heading)) * norm(heading[0], heading[1]) KeyboardInterrupt:
# Run this cell without modification
# Render the learned behavior
# NOTE: The learned agent is marked by green color.
eval_reward, eval_info = evaluate(
policy=metadrive_trainer.policy,
num_episodes=1,
env_name=metadrive_trainer.env_name,
render="topdown", # Visualize the behaviors in top-down view
verbose=True
)
frames = [pygame.surfarray.array3d(f).swapaxes(0, 1) for f in eval_info["frames"]]
animate(frames)
print("DQN agent achieves {} return in MetaDrive hard environment.".format(eval_reward))
DQN agent achieves 470.16495154186305 return in MetaDrive hard environment.
(30 / 100 points)
Unlike supervised learning, in RL the optimization objective return is not differentiable w.r.t. the neural network parameters. This can be workaround via *Policy Gradient*. It can be proved that policy gradient is an unbiased estimator of the gradient of the objective.
Concretely, let's consider such optimization objective:
$$Q = \mathbb E_{\text{possible trajectories}} \sum_t r(a_t, s_t) = \sum_{s_0, a_0,..} p(s_0, a_0, ..., s_t, a_t) r(s_0, a_0, ..., s_t, a_t) = \sum_{\tau} p(\tau)r(\tau)$$wherein $\sum_t r(a_t, s_t) = r(\tau)$ is the return of trajectory $\tau = (s_0, a_0, ...)$. We remove the discount factor for simplicity. Since we want to maximize Q, we can simply compute the gradient of Q w.r.t. parameter $\theta$ (which is implictly included in $p(\tau)$):
$$\nabla_\theta Q = \nabla_\theta \sum_{\tau} p(\tau)r(\tau) = \sum_{\tau} r(\tau) \nabla_\theta p(\tau)$$Apply a famous trick: $\nabla_\theta p(\tau) = p(\tau)\cfrac{\nabla_\theta p(\tau)}{p(\tau)} = p(\tau)\nabla_\theta \log p(\tau)$.
Introducing a log term can change the product of probabilities to sum of log probabilities. Now we can expand the log of product above to sum of log:
$$p_\theta(\tau) = p(s_0, a_0, ...) = p(s_0) \prod_t \pi_\theta (a_t|s_t) p(s_{t+1}|s_t, a_t)$$$$\log p_\theta (\tau) = \log p(s_0) + \sum_t \log \pi_\theta(a_t|s_t) + \sum_t \log p(s_{t+1}|s_t, a_t)$$You can find that the first and third term are not correlated to the parameter of policy $\pi_\theta(\cdot)$. So when we moving back to $\nabla_\theta Q$, we find
$$\nabla_\theta Q = \sum_{\tau} r(\tau) \nabla_\theta p(\tau) = \sum_{\tau} r(\tau) p(\tau)\nabla_\theta \log p(\tau) = \sum p_\theta(\tau) ( \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) ) r(\tau) d\tau$$When we sample sufficient amount of data from the environment, the above equation can be estimated via:
$$\nabla_\theta Q =\cfrac{1}{N}\sum_{i=1}^N [( \sum_t \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t}) (\sum_{t'=t} \gamma^{t'-t} r(s_{i,t'}, a_{i,t'}) )]$$This algorithm is called REINFORCE algorithm, which is a Monte Carlo Policy Gradient algorithm with long history. In this section, we will implement the it using pytorch.
The policy network is composed by two parts:
# Run this cell without modification
class PGNetwork(nn.Module):
def __init__(self, obs_dim, act_dim, hidden_units=128):
super(PGNetwork, self).__init__()
self.network = PytorchModel(obs_dim, act_dim, hidden_units)
def forward(self, obs):
logit = self.network(obs)
# [TODO] Create an object of the class "torch.distributions.Categorical"
# with logit. Hint: don't mess up `logits`
# Then sample an action from it.
m = torch.distributions.Categorical(logits = logit)
action = m.sample()
return action
def log_prob(self, obs, act):
logits = self.network(obs)
# [TODO] Create an object of the class "torch.distributions.Categorical"
# Then get the log probability of the action `act` in this distribution.
m = torch.distributions.Categorical(logits = logits)
log_prob = m.log_prob(act)
return log_prob
# Note that we do not implement GaussianPolicy here. So we can't
# apply our algorithm to the environment with continous action.
# Solve the TODOs and remove `pass`
PG_DEFAULT_CONFIG = merge_config(dict(
normalize_advantage=True,
clip_norm=10.0,
clip_gradient=True,
hidden_units=100,
max_iteration=1000,
train_batch_size=1000,
gamma=0.99,
learning_rate=0.01,
env_name="CartPole-v0",
), DEFAULT_CONFIG)
class PGTrainer(AbstractTrainer):
def __init__(self, config=None):
config = merge_config(config, PG_DEFAULT_CONFIG)
super().__init__(config)
self.iteration = 0
self.start_time = time.time()
self.iteration_time = self.start_time
self.total_timesteps = 0
self.total_episodes = 0
# build the model
self.initialize_parameters()
def initialize_parameters(self):
"""Build the policy network and related optimizer"""
# Detect whether you have GPU or not. Remember to call X.to(self.device)
# if necessary.
self.device = torch.device(
"cuda" if torch.cuda.is_available() else "cpu"
)
# Build the policy network
self.network = PGNetwork(
self.obs_dim, self.act_dim,
hidden_units=self.config["hidden_units"]
).to(self.device)
# Build the Adam optimizer.
self.optimizer = torch.optim.Adam(
self.network.parameters(),
lr=self.config["learning_rate"]
)
def to_tensor(self, array):
"""Transform a numpy array to a pytorch tensor"""
return torch.from_numpy(array).type(torch.float32).to(self.device)
def to_array(self, tensor):
"""Transform a pytorch tensor to a numpy array"""
ret = tensor.cpu().detach().numpy()
if ret.size == 1:
ret = ret.item()
return ret
def save(self, loc="model.pt"):
torch.save(self.network.state_dict(), loc)
def load(self, loc="model.pt"):
self.network.load_state_dict(torch.load(loc))
def compute_action(self, observation, eps=None):
"""Compute the action for single observation. eps is useless here."""
assert observation.ndim == 1
# [TODO] Sample an action from action distribution given by the policy
# Hint: The input of policy network is a batch of data, so you need to
# expand the first dimension of observation before feeding it to policy network.
obs = self.to_tensor(observation).unsqueeze(0)
action = self.to_array(self.network(obs))
return action
def compute_log_probs(self, observation, action):
"""Compute the log probabilities of a batch of state-action pair"""
# [TODO] Using the function of policy network to get log probs.
# Hint: Remember to transform the data into tensor before feeding it.
obs = self.to_tensor(observation).unsqueeze(0)
act = self.to_tensor(action)
log_probs = self.network.log_prob(obs, act).squeeze(0)
return log_probs
def update_network(self, processed_samples):
"""Update the policy network"""
advantages = self.to_tensor(processed_samples["advantages"])
flat_obs = np.concatenate(processed_samples["obs"])
flat_act = np.concatenate(processed_samples["act"])
self.network.train()
self.optimizer.zero_grad()
log_probs = self.compute_log_probs(flat_obs, flat_act)
assert log_probs.shape == advantages.shape, "log_probs shape {} is not " \
"compatible with advantages {}".format(
log_probs.shape, advantages.shape)
# [TODO] Compute the loss using log probabilities and advantages.
loss = (-log_probs*advantages).sum()
loss.backward()
# Clip the gradient
torch.nn.utils.clip_grad_norm_(
self.network.parameters(), self.config["clip_gradient"]
)
self.optimizer.step()
self.network.eval()
update_info = {
"policy_loss": loss.item(),
"mean_log_prob": torch.mean(log_probs).item(),
"mean_advantage": torch.mean(advantages).item()
}
return update_info
# ===== Training-related functions =====
def collect_samples(self):
"""Here we define the pipeline to collect sample even though
any specify functions are not implemented yet.
"""
iter_timesteps = 0
iter_episodes = 0
episode_lens = []
episode_rewards = []
episode_obs_list = []
episode_act_list = []
episode_reward_list = []
success_list = []
while iter_timesteps <= self.config["train_batch_size"]:
obs_list, act_list, reward_list = [], [], []
obs = self.env.reset()
steps = 0
episode_reward = 0
while True:
act = self.compute_action(obs)
# print("ACT: ", act, type(act))
next_obs, reward, done, step_info = self.env.step(act)
obs_list.append(obs)
act_list.append(act)
reward_list.append(reward)
obs = next_obs.copy()
steps += 1
episode_reward += reward
if done or steps > self.config["max_episode_length"]:
if "arrive_dest" in step_info:
success_list.append(step_info["arrive_dest"])
break
iter_timesteps += steps
iter_episodes += 1
episode_rewards.append(episode_reward)
episode_lens.append(steps)
episode_obs_list.append(np.array(obs_list, dtype=np.float32))
episode_act_list.append(np.array(act_list, dtype=np.float32))
episode_reward_list.append(np.array(reward_list, dtype=np.float32))
# [TODO] Uncomment everything below and understand the data structure:
# The return `samples` is a dict that contains several fields.
# Each field (key-value pair) contains a list.
# Each element in list is a list represent the data in one trajectory (episode).
# Each episode list contains the data of that field of all time steps in that episode.
# The return `sample_info` is a dict contains logging item name and its value.
samples = {
"obs": episode_obs_list,
"act": episode_act_list,
"reward": episode_reward_list
}
sample_info = {
"iter_timesteps": iter_timesteps,
"iter_episodes": iter_episodes,
"performance": np.mean(episode_rewards), # help drawing figures
"ep_len": float(np.mean(episode_lens)),
"ep_ret": float(np.mean(episode_rewards)),
"episode_len": sum(episode_lens),
"success_rate": np.mean(success_list)
}
return samples, sample_info
def process_samples(self, samples):
"""Process samples and add advantages in it"""
values = []
for reward_list in samples["reward"]:
# reward_list contains rewards in one episode
returns = np.zeros_like(reward_list, dtype=np.float32)
Q = 0
# [TODO] Scan the episode in a reverse order and compute the
# discounted return at each time step. Fill the array `returns`
# Each entry to the returns is the target Q value of current time step
for i, r in reversed(list(enumerate(reward_list))):
Q = r + self.config['gamma'] * Q
returns[i] = Q
values.append(returns)
# We call the values advantage here.
advantages = np.concatenate(values)
if self.config["normalize_advantage"]:
# [TODO] normalize the advantage so that it's mean is
# almost 0 and the its standard deviation is almost 1.
advantages = (advantages - advantages.mean()) / max(advantages.std(), 1e-6)
samples["advantages"] = advantages
return samples, {}
# ===== Training iteration =====
def train(self):
"""Here we defined the training pipeline using the abstract
functions."""
info = dict(iteration=self.iteration)
# [TODO] Uncomment the following block and go through the learning
# pipeline.
# Collect samples
samples, sample_info = self.collect_samples()
info.update(sample_info)
# Process samples
processed_samples, processed_info = self.process_samples(samples)
info.update(processed_info)
# Update the model
update_info = self.update_network(processed_samples)
info.update(update_info)
now = time.time()
self.iteration += 1
self.total_timesteps += info.pop("iter_timesteps")
self.total_episodes += info.pop("iter_episodes")
# info["iter_time"] = now - self.iteration_time
# info["total_time"] = now - self.start_time
info["total_episodes"] = self.total_episodes
info["total_timesteps"] = self.total_timesteps
self.iteration_time = now
# print("INFO: ", info)
return info
# Run this cell without modification
# Test advantage computing
test_trainer = PGTrainer({"normalize_advantage": False})
test_trainer.train()
fake_sample = {"reward": [[2, 2, 2, 2, 2]]}
np.testing.assert_almost_equal(
test_trainer.process_samples(fake_sample)[0]["reward"][0],
fake_sample["reward"][0]
)
np.testing.assert_almost_equal(
test_trainer.process_samples(fake_sample)[0]["advantages"],
np.array([9.80199, 7.880798, 5.9402, 3.98, 2.], dtype=np.float32)
)
# Test advantage normalization
test_trainer = PGTrainer(
{"normalize_advantage": True, "env_name": "LunarLander-v2"})
test_adv = test_trainer.process_samples(fake_sample)[0]["advantages"]
np.testing.assert_almost_equal(test_adv.mean(), 0.0)
np.testing.assert_almost_equal(test_adv.std(), 1.0)
# Test the shape of functions' returns
fake_observation = np.array([
test_trainer.env.observation_space.sample() for i in range(10)
])
fake_action = np.array([
test_trainer.env.action_space.sample() for i in range(10)
])
assert test_trainer.to_tensor(fake_observation).shape == torch.Size([10, 8])
assert np.array(test_trainer.compute_action(fake_observation[0])).shape == ()
assert test_trainer.compute_log_probs(fake_observation, fake_action).shape == \
torch.Size([10])
print("Test Passed!")
Num inputs: 4, Num actions: 2 Num inputs: 8, Num actions: 4 Test Passed!
# Run this cell without modification
pg_trainer_no_na, pg_result_no_na = run(PGTrainer, dict(
learning_rate=0.01,
max_episode_length=200,
train_batch_size=200,
env_name="CartPole-v0",
normalize_advantage=False, # <<== Here!
evaluate_interval=10,
evaluate_num_episodes=10,
), 195.0)
Num inputs: 4, Num actions: 2
(0.1s,+0.1s) Iter 0, Step 209, episodic return is 21.10. {'iteration': 0.0, 'performance': 20.9, 'ep_len': 20.9, 'ep_ret': 20.9, 'episode_len': 209.0, 'policy_loss': 1702.0168, 'mean_log_prob': -0.6843, 'mean_advantage': 11.7636, 'total_episodes': 10.0, 'total_timesteps': 209.0}
(1.3s,+1.2s) Iter 10, Step 2701, episodic return is 68.30. {'iteration': 10.0, 'performance': 63.75, 'ep_len': 63.75, 'ep_ret': 63.75, 'episode_len': 255.0, 'policy_loss': 2922.1821, 'mean_log_prob': -0.409, 'mean_advantage': 27.273, 'total_episodes': 60.0, 'total_timesteps': 2701.0}
(2.5s,+1.2s) Iter 20, Step 5281, episodic return is 135.20. {'iteration': 20.0, 'performance': 126.5, 'ep_len': 126.5, 'ep_ret': 126.5, 'episode_len': 253.0, 'policy_loss': 2303.0745, 'mean_log_prob': -0.2133, 'mean_advantage': 45.3509, 'total_episodes': 87.0, 'total_timesteps': 5281.0}
(3.6s,+1.1s) Iter 30, Step 7830, episodic return is 102.30. {'iteration': 30.0, 'performance': 101.0, 'ep_len': 101.0, 'ep_ret': 101.0, 'episode_len': 202.0, 'policy_loss': 1468.9717, 'mean_log_prob': -0.1822, 'mean_advantage': 37.5016, 'total_episodes': 108.0, 'total_timesteps': 7830.0}
(4.7s,+1.1s) Iter 40, Step 10438, episodic return is 161.30. {'iteration': 40.0, 'performance': 146.0, 'ep_len': 146.0, 'ep_ret': 146.0, 'episode_len': 292.0, 'policy_loss': 2212.1445, 'mean_log_prob': -0.1678, 'mean_advantage': 48.1735, 'total_episodes': 128.0, 'total_timesteps': 10438.0}
(6.1s,+1.5s) Iter 50, Step 14094, episodic return is 170.60. {'iteration': 50.0, 'performance': 176.0, 'ep_len': 176.0, 'ep_ret': 176.0, 'episode_len': 352.0, 'policy_loss': 2646.0068, 'mean_log_prob': -0.1514, 'mean_advantage': 53.5776, 'total_episodes': 148.0, 'total_timesteps': 14094.0}
(7.4s,+1.2s) Iter 60, Step 17319, episodic return is 123.40. {'iteration': 60.0, 'performance': 136.0, 'ep_len': 136.0, 'ep_ret': 136.0, 'episode_len': 272.0, 'policy_loss': 1637.0583, 'mean_log_prob': -0.1332, 'mean_advantage': 45.7656, 'total_episodes': 168.0, 'total_timesteps': 17319.0}
(8.8s,+1.4s) Iter 70, Step 20074, episodic return is 189.70. {'iteration': 70.0, 'performance': 179.0, 'ep_len': 179.0, 'ep_ret': 179.0, 'episode_len': 358.0, 'policy_loss': 3219.593, 'mean_log_prob': -0.1785, 'mean_advantage': 54.0486, 'total_episodes': 188.0, 'total_timesteps': 20074.0}
(10.2s,+1.4s) Iter 80, Step 23622, episodic return is 161.60. {'iteration': 80.0, 'performance': 145.0, 'ep_len': 145.0, 'ep_ret': 145.0, 'episode_len': 290.0, 'policy_loss': 752.9906, 'mean_log_prob': -0.0581, 'mean_advantage': 48.1277, 'total_episodes': 208.0, 'total_timesteps': 23622.0}
(11.3s,+1.1s) Iter 90, Step 26460, episodic return is 123.50. {'iteration': 90.0, 'performance': 128.0, 'ep_len': 128.0, 'ep_ret': 128.0, 'episode_len': 256.0, 'policy_loss': 1176.2876, 'mean_log_prob': -0.1193, 'mean_advantage': 44.0614, 'total_episodes': 228.0, 'total_timesteps': 26460.0}
(12.2s,+0.9s) Iter 100, Step 28940, episodic return is 73.00. {'iteration': 100.0, 'performance': 76.3333, 'ep_len': 76.3333, 'ep_ret': 76.3333, 'episode_len': 229.0, 'policy_loss': 1176.343, 'mean_log_prob': -0.1615, 'mean_advantage': 33.4778, 'total_episodes': 255.0, 'total_timesteps': 28940.0}
(13.0s,+0.8s) Iter 110, Step 31470, episodic return is 72.00. {'iteration': 110.0, 'performance': 83.0, 'ep_len': 83.0, 'ep_ret': 83.0, 'episode_len': 249.0, 'policy_loss': 1119.4987, 'mean_log_prob': -0.1432, 'mean_advantage': 34.3346, 'total_episodes': 293.0, 'total_timesteps': 31470.0}
(13.8s,+0.8s) Iter 120, Step 33656, episodic return is 110.30. {'iteration': 120.0, 'performance': 110.5, 'ep_len': 110.5, 'ep_ret': 110.5, 'episode_len': 221.0, 'policy_loss': 797.1689, 'mean_log_prob': -0.1111, 'mean_advantage': 39.9201, 'total_episodes': 317.0, 'total_timesteps': 33656.0}
(14.7s,+0.9s) Iter 130, Step 36070, episodic return is 107.40. {'iteration': 130.0, 'performance': 84.3333, 'ep_len': 84.3333, 'ep_ret': 84.3333, 'episode_len': 253.0, 'policy_loss': 441.4778, 'mean_log_prob': -0.0549, 'mean_advantage': 33.024, 'total_episodes': 346.0, 'total_timesteps': 36070.0}
(15.8s,+1.1s) Iter 140, Step 38761, episodic return is 196.80. {'iteration': 140.0, 'performance': 150.0, 'ep_len': 150.0, 'ep_ret': 150.0, 'episode_len': 300.0, 'policy_loss': 1408.6431, 'mean_log_prob': -0.1072, 'mean_advantage': 48.6225, 'total_episodes': 367.0, 'total_timesteps': 38761.0}
In 140 iteration, episodic return 196.800 is greater than reward threshold 195.0. Congratulation! Now we exit the training process.
Environment is closed.
# Run this cell without modification
pg_trainer_na, pg_result_na = run(PGTrainer, dict(
learning_rate=0.01,
max_episode_length=200,
train_batch_size=200,
env_name="CartPole-v0",
normalize_advantage=True, # <<== Here!
evaluate_interval=10,
evaluate_num_episodes=10,
), 195.0)
Num inputs: 4, Num actions: 2
(0.2s,+0.2s) Iter 0, Step 239, episodic return is 29.20. {'iteration': 0.0, 'performance': 39.8333, 'ep_len': 39.8333, 'ep_ret': 39.8333, 'episode_len': 239.0, 'policy_loss': -0.9876, 'mean_log_prob': -0.6911, 'mean_advantage': -0.0, 'total_episodes': 6.0, 'total_timesteps': 239.0}
(1.0s,+0.8s) Iter 10, Step 2558, episodic return is 44.80. {'iteration': 10.0, 'performance': 52.0, 'ep_len': 52.0, 'ep_ret': 52.0, 'episode_len': 208.0, 'policy_loss': 1.2986, 'mean_log_prob': -0.5519, 'mean_advantage': 0.0, 'total_episodes': 58.0, 'total_timesteps': 2558.0}
(2.0s,+1.0s) Iter 20, Step 5000, episodic return is 133.00. {'iteration': 20.0, 'performance': 79.0, 'ep_len': 79.0, 'ep_ret': 79.0, 'episode_len': 237.0, 'policy_loss': -5.5536, 'mean_log_prob': -0.5298, 'mean_advantage': -0.0, 'total_episodes': 94.0, 'total_timesteps': 5000.0}
(3.1s,+1.2s) Iter 30, Step 7662, episodic return is 101.70. {'iteration': 30.0, 'performance': 87.6667, 'ep_len': 87.6667, 'ep_ret': 87.6667, 'episode_len': 263.0, 'policy_loss': -10.2336, 'mean_log_prob': -0.5189, 'mean_advantage': -0.0, 'total_episodes': 127.0, 'total_timesteps': 7662.0}
(4.7s,+1.6s) Iter 40, Step 11381, episodic return is 155.50. {'iteration': 40.0, 'performance': 166.5, 'ep_len': 166.5, 'ep_ret': 166.5, 'episode_len': 333.0, 'policy_loss': -5.3496, 'mean_log_prob': -0.4431, 'mean_advantage': -0.0, 'total_episodes': 148.0, 'total_timesteps': 11381.0}
(5.7s,+1.0s) Iter 50, Step 14089, episodic return is 133.70. {'iteration': 50.0, 'performance': 129.5, 'ep_len': 129.5, 'ep_ret': 129.5, 'episode_len': 259.0, 'policy_loss': -9.5428, 'mean_log_prob': -0.4011, 'mean_advantage': -0.0, 'total_episodes': 169.0, 'total_timesteps': 14089.0}
(7.6s,+1.9s) Iter 60, Step 17599, episodic return is 200.00. {'iteration': 60.0, 'performance': 200.0, 'ep_len': 200.0, 'ep_ret': 200.0, 'episode_len': 400.0, 'policy_loss': -9.3366, 'mean_log_prob': -0.4089, 'mean_advantage': 0.0, 'total_episodes': 189.0, 'total_timesteps': 17599.0}
In 60 iteration, episodic return 200.000 is greater than reward threshold 195.0. Congratulation! Now we exit the training process.
Environment is closed.
# Run this cell without modification
pg_result_no_na_df = pd.DataFrame(pg_result_no_na)
pg_result_na_df = pd.DataFrame(pg_result_na)
pg_result_no_na_df["normalize_advantage"] = False
pg_result_na_df["normalize_advantage"] = True
ax = sns.lineplot(
x="total_timesteps",
y="performance",
data=pd.concat([pg_result_no_na_df, pg_result_na_df]).reset_index(), hue="normalize_advantage",
)
ax.set_title("Policy Gradient: Advantage normalization matters!")
Text(0.5, 1.0, 'Policy Gradient: Advantage normalization matters!')
# Run this cell without modification
env_name = "MetaDrive-Tut-Easy-v0"
pg_trainer_metadrive_easy, pg_trainer_metadrive_easy_result = run(PGTrainer, dict(
train_batch_size=2000,
normalize_advantage=True,
max_episode_length=200,
max_iteration=5000,
evaluate_interval=10,
evaluate_num_episodes=10,
learning_rate=0.001,
clip_norm=10.0,
env_name=env_name
), reward_threshold=120)
pg_trainer_metadrive_easy.save("pg_trainer_metadrive_easy.pt")
WARNING:root:BaseEngine is not launched, fail to sync seed to engine! :task(warning): Creating implicit AsyncTaskChain default for AsyncTaskManager TaskManager
Num inputs: 259, Num actions: 9
(6.5s,+6.5s) Iter 0, Step 2010, episodic return is 2.83. {'iteration': 0.0, 'performance': 2.6934, 'ep_len': 201.0, 'ep_ret': 2.6934, 'episode_len': 2010.0, 'success_rate': 0.0, 'policy_loss': -3.9755, 'mean_log_prob': -2.1895, 'mean_advantage': 0.0, 'total_episodes': 10.0, 'total_timesteps': 2010.0}
(41.2s,+34.7s) Iter 10, Step 22520, episodic return is 7.83. {'iteration': 10.0, 'performance': 7.0545, 'ep_len': 186.6364, 'ep_ret': 7.0545, 'episode_len': 2053.0, 'success_rate': 0.0, 'policy_loss': -63.0099, 'mean_log_prob': -1.9599, 'mean_advantage': 0.0, 'total_episodes': 117.0, 'total_timesteps': 22520.0}
(79.8s,+38.6s) Iter 20, Step 43581, episodic return is 4.51. {'iteration': 20.0, 'performance': 6.7758, 'ep_len': 118.9412, 'ep_ret': 6.7758, 'episode_len': 2022.0, 'success_rate': 0.0, 'policy_loss': 23.7856, 'mean_log_prob': -1.5874, 'mean_advantage': -0.0, 'total_episodes': 261.0, 'total_timesteps': 43581.0}
(119.4s,+39.6s) Iter 30, Step 64259, episodic return is 8.89. {'iteration': 30.0, 'performance': 11.3518, 'ep_len': 137.9333, 'ep_ret': 11.3518, 'episode_len': 2069.0, 'success_rate': 0.0, 'policy_loss': 2.8028, 'mean_log_prob': -1.5688, 'mean_advantage': 0.0, 'total_episodes': 427.0, 'total_timesteps': 64259.0}
(159.2s,+39.8s) Iter 40, Step 84903, episodic return is 21.91. {'iteration': 40.0, 'performance': 17.9439, 'ep_len': 105.7895, 'ep_ret': 17.9439, 'episode_len': 2010.0, 'success_rate': 0.0, 'policy_loss': -36.0062, 'mean_log_prob': -1.5124, 'mean_advantage': -0.0, 'total_episodes': 614.0, 'total_timesteps': 84903.0}
(201.1s,+41.9s) Iter 50, Step 105383, episodic return is 52.13. {'iteration': 50.0, 'performance': 53.1934, 'ep_len': 88.9565, 'ep_ret': 53.1934, 'episode_len': 2046.0, 'success_rate': 0.0435, 'policy_loss': -77.9743, 'mean_log_prob': -1.0615, 'mean_advantage': 0.0, 'total_episodes': 832.0, 'total_timesteps': 105383.0}
(245.5s,+44.4s) Iter 60, Step 125943, episodic return is 76.89. {'iteration': 60.0, 'performance': 66.1258, 'ep_len': 81.0, 'ep_ret': 66.1258, 'episode_len': 2025.0, 'success_rate': 0.16, 'policy_loss': -23.4189, 'mean_log_prob': -0.4862, 'mean_advantage': -0.0, 'total_episodes': 1068.0, 'total_timesteps': 125943.0}
(291.0s,+45.5s) Iter 70, Step 146394, episodic return is 125.43. {'iteration': 70.0, 'performance': 110.4208, 'ep_len': 93.3636, 'ep_ret': 110.4208, 'episode_len': 2054.0, 'success_rate': 0.7727, 'policy_loss': -123.115, 'mean_log_prob': -0.1111, 'mean_advantage': -0.0, 'total_episodes': 1297.0, 'total_timesteps': 146394.0}
In 70 iteration, episodic return 125.434 is greater than reward threshold 120. Congratulation! Now we exit the training process.
Environment is closed.
# Run this cell without modification
# Render the learned behavior
# NOTE: The learned agent is marked by green color.
eval_reward, eval_info = evaluate(
policy=pg_trainer_metadrive_easy.policy,
num_episodes=1,
env_name=pg_trainer_metadrive_easy.env_name,
render="topdown", # Visualize the behaviors in top-down view
verbose=True
)
frames = [pygame.surfarray.array3d(f).swapaxes(0, 1) for f in eval_info["frames"]]
animate(frames)
print("REINFORCE agent achieves {} return in MetaDrive easy environment.".format(eval_reward))
REINFORCE agent achieves 125.58145966674864 return in MetaDrive easy environment.
(20 / 100 points)
We compute the gradient of $Q = \mathop{\mathbb E} \sum_t r(a_t, s_t)$ w.r.t. the parameter to update the policy. Let's consider this case: when you take a so-so action that lead to positive expected return, the policy gradient is also positive and you will update your network toward this action. At the same time a potential better action is ignored.
To tackle this problem, we introduce the "baseline" when computing the policy gradient. The insight behind this is that we want to optimize the policy toward an action that are better than the "average action".
We introduce $b_{t} = \mathbb E_{a_t} \sum_{t'}{\gamma^{t'-t} r(s_{t'}, a_{t'})}$ as the baseline. It average the expected discount return of all possible actions at state $s_t$. So that the "advantage" achieved by action $a_t$ can be evaluated via $\sum_{t'=t} \gamma^{t' -t}r(a_{t'}, s_{t'}) - b_t$
Therefore, the policy gradient becomes:
$$\nabla_\theta Q =\cfrac{1}{N}\sum_{i=1}^N [( \sum_t \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t}) (\sum_{t'} \gamma^{t'-t} r(s_{i,{t’}}, a_{i,t‘}) - b_{i, t})]$$In our implementation, we estimate the baseline via an extra network self.baseline, which has same structure of policy network but output only a scalar value. We use the output of this network to serve as the baseline, while this network is updated by fitting the true value of expected return of current state: $\mathbb E_{a_t} \sum_{t'}{\gamma^{t'-t} r(s_{t'}, a_{t'})}$
The state-action values might have large variance if the reward function has large variance. It is not easy for a neural network to predict targets with large variance and extreme values. In implementation, we use a trick to match the distribution of baseline and values. During training, we first collect a batch of target values: $\{t_i= \mathbb E_{a_t} \sum_{t'}{\gamma^{t'-t} r(s_{t'}, a_{t'})}\}_i$. Then we normalize all targets to a standard distribution with mean = 0 and std = 1. Then we ask the baseline network to fit such normalized targets.
When computing the advantages, instead of using the output of baseline network as the baseline $b$, we firstly match the baseline distribution with state-action values, that is we "de-standarize" the baselines. The transformed baselines $b' = f(b)$ should has the same mean and STD with the action values.
After that, we compute the advantage of current action: $adv_{i,t} = \sum_{t'} \gamma^{t'-t} r(s_{i,{t'}}, a_{i,t'}) - b'_{i, t}$
By doing this, we mitigate the instability of training baseline.
Hint: We suggest to normalize an array via: (x - x.mean()) / max(x.std(), 1e-6). The max term can mitigate numeraical instability.
class PolicyGradientWithBaselineTrainer(PGTrainer):
def initialize_parameters(self):
# Build the actor in name of self.policy
super().initialize_parameters()
# Build the baseline network using Net class.
self.baseline = PytorchModel(
self.obs_dim, 1, self.config["hidden_units"]
).to(self.device)
self.baseline_loss = nn.MSELoss()
self.baseline_optimizer = torch.optim.Adam(
self.baseline.parameters(),
lr=self.config["learning_rate"]
)
def process_samples(self, samples):
# Call the original process_samples function to get advantages
tmp_samples, _ = super().process_samples(samples)
values = tmp_samples["advantages"]
samples["values"] = values # We add q_values into samples
# [TODO] flatten the observations in all trajectories (still a numpy array)
obs = np.concatenate(samples['obs'], axis=0)
assert obs.ndim == 2
assert obs.shape[1] == self.obs_dim
obs = self.to_tensor(obs)
samples["flat_obs"] = obs
# [TODO] Compute the baseline by feeding observation to baseline network
# Hint: `baselines` is a numpy array with the same shape of `values`,
# that is: (batch size, )
baselines = self.to_array(self.baseline(obs)).reshape(-1)
assert baselines.shape == values.shape
# [TODO] Match the distribution of baselines to the values.
# Hint: We expect to see baselines.std() almost equals to values.std(),
# and baselines.mean() almost equals to values.mean()
baselines = (baselines - baselines.mean())/max(baselines.std(), 1e-6)
# Compute the advantage
advantages = values - baselines
samples["advantages"] = advantages
process_info = {"mean_baseline": float(np.mean(baselines))}
return samples, process_info
def update_network(self, processed_samples):
update_info = super().update_network(processed_samples)
update_info.update(self.update_baseline(processed_samples))
return update_info
def update_baseline(self, processed_samples):
self.baseline.train()
obs = processed_samples["flat_obs"]
# [TODO] Normalize the values to mean=0, std=1.
values = processed_samples["values"]
values = (values - values.mean())/max(values.std(), 1e-6)
values = self.to_tensor(values[:, np.newaxis])
baselines = self.baseline(obs)
self.baseline_optimizer.zero_grad()
loss = self.baseline_loss(input=baselines, target=values)
loss.backward()
# Clip the gradient
torch.nn.utils.clip_grad_norm_(
self.baseline.parameters(), self.config["clip_gradient"]
)
self.baseline_optimizer.step()
self.baseline.eval()
return dict(baseline_loss=loss.item())
# Run this cell without modification
pg_trainer_wb_cartpole, pg_trainer_wb_cartpole_result = run(PolicyGradientWithBaselineTrainer, dict(
learning_rate=0.01,
max_episode_length=200,
train_batch_size=200,
env_name="CartPole-v0",
normalize_advantage=True,
evaluate_interval=10,
evaluate_num_episodes=10,
), 195.0)
Num inputs: 4, Num actions: 2
Num inputs: 4, Num actions: 1
(0.1s,+0.1s) Iter 0, Step 219, episodic return is 22.70. {'iteration': 0.0, 'performance': 21.9, 'ep_len': 21.9, 'ep_ret': 21.9, 'episode_len': 219.0, 'mean_baseline': 0.0, 'policy_loss': 1.0299, 'mean_log_prob': -0.6932, 'mean_advantage': -0.0, 'baseline_loss': 1.0166, 'total_episodes': 10.0, 'total_timesteps': 219.0}
/Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages/numpy/core/fromnumeric.py:3441: RuntimeWarning: Mean of empty slice. out=out, **kwargs) /Users/qiqi/opt/anaconda3/envs/cs269/lib/python3.7/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount)
(1.4s,+1.3s) Iter 10, Step 2618, episodic return is 162.80. {'iteration': 10.0, 'performance': 124.0, 'ep_len': 124.0, 'ep_ret': 124.0, 'episode_len': 248.0, 'mean_baseline': -0.0, 'policy_loss': 7.7531, 'mean_log_prob': -0.5437, 'mean_advantage': 0.0, 'baseline_loss': 0.87, 'total_episodes': 58.0, 'total_timesteps': 2618.0}
(3.1s,+1.7s) Iter 20, Step 6338, episodic return is 193.30. {'iteration': 20.0, 'performance': 200.0, 'ep_len': 200.0, 'ep_ret': 200.0, 'episode_len': 400.0, 'mean_baseline': 0.0, 'policy_loss': -28.1118, 'mean_log_prob': -0.5119, 'mean_advantage': 0.0, 'baseline_loss': 0.9115, 'total_episodes': 78.0, 'total_timesteps': 6338.0}
(4.4s,+1.3s) Iter 30, Step 9847, episodic return is 141.10. {'iteration': 30.0, 'performance': 156.5, 'ep_len': 156.5, 'ep_ret': 156.5, 'episode_len': 313.0, 'mean_baseline': 0.0, 'policy_loss': -2.455, 'mean_log_prob': -0.509, 'mean_advantage': 0.0, 'baseline_loss': 0.1426, 'total_episodes': 98.0, 'total_timesteps': 9847.0}
(5.8s,+1.5s) Iter 40, Step 13564, episodic return is 200.00. {'iteration': 40.0, 'performance': 200.0, 'ep_len': 200.0, 'ep_ret': 200.0, 'episode_len': 400.0, 'mean_baseline': 0.0, 'policy_loss': -9.7795, 'mean_log_prob': -0.3871, 'mean_advantage': -0.0, 'baseline_loss': 1.1415, 'total_episodes': 118.0, 'total_timesteps': 13564.0}
In 40 iteration, episodic return 200.000 is greater than reward threshold 195.0. Congratulation! Now we exit the training process.
Environment is closed.
# Run this cell without modification
pg_result_no_na_df = pd.DataFrame(pg_result_no_na)
pg_result_no_na_df["normalize_advantage"] = "False"
pg_result_na_df = pd.DataFrame(pg_result_na)
pg_result_na_df["normalize_advantage"] = "True"
pg_trainer_wb_cartpole_result_df = pd.DataFrame(pg_trainer_wb_cartpole_result)
pg_trainer_wb_cartpole_result_df["normalize_advantage"] = "True with Baseline"
pg_result_df = pd.concat([pg_result_no_na_df, pg_result_na_df, pg_trainer_wb_cartpole_result_df]).reset_index()
ax = sns.lineplot(
x="total_timesteps",
y="performance",
data=pg_result_df, hue="normalize_advantage",
)
ax.set_title("Policy Gradient")
Text(0.5, 1.0, 'Policy Gradient')
# Run this cell without modification
env_name = "MetaDrive-Tut-Easy-v0"
pg_trainer_wb_metadrive_easy, pg_trainer_wb_metadrive_easy_result = run(
PolicyGradientWithBaselineTrainer,
dict(
train_batch_size=2000,
normalize_advantage=True,
max_episode_length=200,
max_iteration=5000,
evaluate_interval=10,
evaluate_num_episodes=10,
learning_rate=0.001,
clip_norm=10.0,
env_name=env_name
),
reward_threshold=120
)
pg_trainer_wb_metadrive_easy.save("pg_trainer_wb_metadrive_easy.pt")
WARNING:root:BaseEngine is not launched, fail to sync seed to engine! :task(warning): Creating implicit AsyncTaskChain default for AsyncTaskManager TaskManager
Num inputs: 259, Num actions: 9
Num inputs: 259, Num actions: 1
(6.5s,+6.5s) Iter 0, Step 2010, episodic return is 0.56. {'iteration': 0.0, 'performance': 2.5254, 'ep_len': 201.0, 'ep_ret': 2.5254, 'episode_len': 2010.0, 'success_rate': 0.0, 'mean_baseline': 0.0, 'policy_loss': -11.9632, 'mean_log_prob': -2.19, 'mean_advantage': -0.0, 'baseline_loss': 1.0011, 'total_episodes': 10.0, 'total_timesteps': 2010.0}
(44.1s,+37.6s) Iter 10, Step 23030, episodic return is 11.54. {'iteration': 10.0, 'performance': 8.3377, 'ep_len': 176.75, 'ep_ret': 8.3377, 'episode_len': 2121.0, 'success_rate': 0.0, 'mean_baseline': -0.0, 'policy_loss': -112.6892, 'mean_log_prob': -1.9327, 'mean_advantage': 0.0, 'baseline_loss': 1.0012, 'total_episodes': 119.0, 'total_timesteps': 23030.0}
(87.7s,+43.6s) Iter 20, Step 43621, episodic return is 100.49. {'iteration': 20.0, 'performance': 87.286, 'ep_len': 87.0, 'ep_ret': 87.286, 'episode_len': 2001.0, 'success_rate': 0.4348, 'mean_baseline': -0.0, 'policy_loss': -3.4383, 'mean_log_prob': -0.2343, 'mean_advantage': -0.0, 'baseline_loss': 0.9913, 'total_episodes': 306.0, 'total_timesteps': 43621.0}
(134.1s,+46.5s) Iter 30, Step 64036, episodic return is 125.58. {'iteration': 30.0, 'performance': 125.5815, 'ep_len': 98.0, 'ep_ret': 125.5815, 'episode_len': 2058.0, 'success_rate': 1.0, 'mean_baseline': -0.0, 'policy_loss': 0.0268, 'mean_log_prob': -0.0002, 'mean_advantage': -0.0, 'baseline_loss': 0.9199, 'total_episodes': 518.0, 'total_timesteps': 64036.0}
In 30 iteration, episodic return 125.581 is greater than reward threshold 120. Congratulation! Now we exit the training process.
Environment is closed.
# Run this cell without modification
# Render the learned behavior
# NOTE: The learned agent is marked by green color.
eval_reward, eval_info = evaluate(
policy=pg_trainer_wb_metadrive_easy.policy,
num_episodes=1,
env_name=pg_trainer_wb_metadrive_easy.env_name,
render="topdown", # Visualize the behaviors in top-down view
verbose=True
)
frames = [pygame.surfarray.array3d(f).swapaxes(0, 1) for f in eval_info["frames"]]
animate(frames)
print("PG agent achieves {} return in MetaDrive easy environment.".format(eval_reward))
PG agent achieves 125.58145966674864 return in MetaDrive easy environment.
# Run this cell without modification
pg_trainer_wb_metadrive_easy_result_df = pd.DataFrame(pg_trainer_wb_metadrive_easy_result)
pg_trainer_wb_metadrive_easy_result_df["with Baseline"] = True
pg_trainer_metadrive_easy_result_df = pd.DataFrame(pg_trainer_metadrive_easy_result)
pg_trainer_metadrive_easy_result_df["with Baseline"] = False
ax = sns.lineplot(
x="total_timesteps",
y="performance",
data=pd.concat([pg_trainer_wb_metadrive_easy_result_df, pg_trainer_metadrive_easy_result_df]).reset_index(),
hue="with Baseline",
)
ax.set_title("Policy Gradient in MetaDrive: Baseline matters!")
Text(0.5, 1.0, 'Policy Gradient in MetaDrive: Baseline matters!')
Goal: Acheive episodic return > 50.
BONUS!! can be earned if you can improve the training performance by adjusting hyper-parameters and optimizing code. Improvement means achieving > 0.0 success rate. However, I can't promise that it is feasible to use PG with or without algorithm to solve this task. Please creates a independent markdown cell to highlight your improvement.
# Run this cell without modification
env_name = "MetaDrive-Tut-Hard-v0"
pg_trainer_wb_metadrive_hard, pg_trainer_wb_metadrive_hard_result = run(
PolicyGradientWithBaselineTrainer,
dict(
train_batch_size=4000,
normalize_advantage=True,
max_episode_length=1000,
max_iteration=5000,
evaluate_interval=10,
evaluate_num_episodes=10,
learning_rate=0.001,
clip_norm=10.0,
env_name=env_name
),
reward_threshold=50
)
pg_trainer_wb_metadrive_hard.save("pg_trainer_wb_metadrive_hard.pt")
WARNING:root:BaseEngine is not launched, fail to sync seed to engine! :task(warning): Creating implicit AsyncTaskChain default for AsyncTaskManager TaskManager
Num inputs: 259, Num actions: 25
Num inputs: 259, Num actions: 1
(47.3s,+47.3s) Iter 0, Step 4004, episodic return is 10.46. {'iteration': 0.0, 'performance': 13.2219, 'ep_len': 1001.0, 'ep_ret': 13.2219, 'episode_len': 4004.0, 'success_rate': 0.0, 'mean_baseline': 0.0, 'policy_loss': 3.1478, 'mean_log_prob': -3.214, 'mean_advantage': -0.0, 'baseline_loss': 1.0871, 'total_episodes': 4.0, 'total_timesteps': 4004.0}
(202.2s,+154.9s) Iter 10, Step 46736, episodic return is 14.14. {'iteration': 10.0, 'performance': 12.4421, 'ep_len': 406.4, 'ep_ret': 12.4421, 'episode_len': 4064.0, 'success_rate': 0.0, 'mean_baseline': 0.0, 'policy_loss': -40.0878, 'mean_log_prob': -3.0039, 'mean_advantage': -0.0, 'baseline_loss': 0.9919, 'total_episodes': 61.0, 'total_timesteps': 46736.0}
(372.2s,+170.0s) Iter 20, Step 89911, episodic return is 13.23. {'iteration': 20.0, 'performance': 12.4368, 'ep_len': 154.5357, 'ep_ret': 12.4368, 'episode_len': 4327.0, 'success_rate': 0.0, 'mean_baseline': 0.0, 'policy_loss': -38.0333, 'mean_log_prob': -2.371, 'mean_advantage': -0.0, 'baseline_loss': 0.9994, 'total_episodes': 225.0, 'total_timesteps': 89911.0}
(540.0s,+167.8s) Iter 30, Step 130635, episodic return is 14.32. {'iteration': 30.0, 'performance': 14.3682, 'ep_len': 103.359, 'ep_ret': 14.3682, 'episode_len': 4031.0, 'success_rate': 0.0, 'mean_baseline': -0.0, 'policy_loss': -79.0915, 'mean_log_prob': -2.1105, 'mean_advantage': 0.0, 'baseline_loss': 1.0027, 'total_episodes': 586.0, 'total_timesteps': 130635.0}
(716.0s,+176.0s) Iter 40, Step 170948, episodic return is 22.22. {'iteration': 40.0, 'performance': 13.1486, 'ep_len': 65.623, 'ep_ret': 13.1486, 'episode_len': 4003.0, 'success_rate': 0.0, 'mean_baseline': 0.0, 'policy_loss': -39.5062, 'mean_log_prob': -1.9408, 'mean_advantage': -0.0, 'baseline_loss': 1.0019, 'total_episodes': 1105.0, 'total_timesteps': 170948.0}
(895.3s,+179.3s) Iter 50, Step 211544, episodic return is 28.86. {'iteration': 50.0, 'performance': 30.3587, 'ep_len': 69.7586, 'ep_ret': 30.3587, 'episode_len': 4046.0, 'success_rate': 0.0, 'mean_baseline': 0.0, 'policy_loss': -182.8523, 'mean_log_prob': -1.4412, 'mean_advantage': 0.0, 'baseline_loss': 0.9922, 'total_episodes': 1694.0, 'total_timesteps': 211544.0}
(1079.4s,+184.1s) Iter 60, Step 251871, episodic return is 53.92. {'iteration': 60.0, 'performance': 54.7134, 'ep_len': 71.1754, 'ep_ret': 54.7134, 'episode_len': 4057.0, 'success_rate': 0.0702, 'mean_baseline': 0.0, 'policy_loss': -63.5202, 'mean_log_prob': -0.271, 'mean_advantage': 0.0, 'baseline_loss': 0.9607, 'total_episodes': 2267.0, 'total_timesteps': 251871.0}
In 60 iteration, episodic return 53.918 is greater than reward threshold 50. Congratulation! Now we exit the training process.
Environment is closed.
# Run this cell without modification
# Render the learned behavior
# NOTE: The learned agent is marked by green color.
eval_reward, eval_info = evaluate(
policy=pg_trainer_wb_metadrive_hard.policy,
num_episodes=1,
env_name=pg_trainer_wb_metadrive_hard.env_name,
render="topdown", # Visualize the behaviors in top-down view
verbose=True
)
frames = [pygame.surfarray.array3d(f).swapaxes(0, 1) for f in eval_info["frames"]]
animate(frames)
print("PG agent achieves {} return in MetaDrive hard environment.".format(eval_reward))
PG agent achieves 52.172567243028006 return in MetaDrive hard environment.
In this assignment, we learn how to build naive Q learning, Deep Q Network and Policy Gradient methods.
In the next markdown cell, you can write whatever you like. Like the suggestion on the course, the confusing problems in the assignments, and so on.
If you want to do more investigation, feel free to open new cells via Esc + B after the next cells and write codes in it, so that you can reuse some result in this notebook. Remember to write sufficient comments and documents to let others know what you are doing.
Following the submission instruction in the assignment to submit your assignment. Thank you!